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CONVERSION  FACTORS,  NON-SI  TO  SI  (METRIC) 
UNITS  OF  MEASUREMENT 


Non-SI  units  of  measurement  used 
(metric)  units  as  follows: 

Multipl 


feet 

inches 

inch-pounds  (mass) 
kips  (1000  lb  mass) 
pound  (mass) -square  inches 
pounds  (force)  per  square  foot 
pounds  (force)  per  square  inch 
pounds  (mass) 

pounds  (mass)  per  cubic  foot 

pounds  (mass)  per  cubic  inch 

pounds  (mass)  per  inch 


in  this  report  can  be  converted  to  SI 


0.3048 

0.0254 

0.01152125 

453.59237 

0.00029264 

47.88026 

6.894757 

0.45359237 

16.01846 

27,679.905 

17.85797 


To  Obtain 


meters 

meters 

kilogram-meters 

kilograms 

kilogram-square  meters 

pascals 

kilopascals 

kilograms 

kilograms  per  cubic  meter 
kilograms  per  cubic  meter 
kilograms  per  meter 


ACCURACY  CONSIDERATIONS  WHEN  USING  SOME  MINICOMPUTERS 
FOR  SCIENTIFIC  AND  ENGINEERING  PROBLEMS 


PART  I :  INTRODUCTION 

Background 

1.  Floating-point  (FP)  computations  are  routinely  used  in  programming 
for  scientific  and  engineering  (S&E)  applications.  Novice  computer  users  tend 
to  implicitly  trust  the  computer  to  produce  correct  answers  and  might  not 
question  the  accuracy  of  results  to  as  many  significant  digits  as  might  be 
printed.  Experienced  users  tend  to  become  disillusioned  after  experiences 
with  erroneous  answers  and  eventually  wary  of  FP  computations  that  produce 
believable  answers  that  are  nevertheless  wrong.  Indeed,  FP  computation  is 
inherently  inexact  and  can  easily  be  inadvertently  misused.  Experience  in 
the  Corps  of  Engineers  has  shown  that  FP  processing  on  some  computers  is  too 
imprecise  for  many  common  S&E  applications. 

2.  One  approach  to  dealing  with  this  problem  is  to  require  that  all 
real  variables  be  double-precision,  but  this  exacts  penalties  in  main  memory 
required  and  often  in  execution  time.  The  penalties  frequently  are  so  severe 
that  this  approach  is  impractical  for  many  applications.  Therefore,  the 
tendency  is  to  process  precision-sensitive  applications  on  long-word-length 
machines;  e.g.,  large-scale  CDC  systems  with  60-bit  words.  In  some  cases, 
careful  analysis  of  a  program  might  indicate  that  acceptable  accuracy  is  at¬ 
tainable  with  shorter  word  lengths  if  minor  program  modifications  are  made; 
e.g.,  double-precision  calculations  only  in  a  few  critical  areas.  While  the 
modifications  may  be  minor,  the  analysis  often  is  not  and  might  involve  ex¬ 
penditure  of  substantial  human  and  machine  resources.  Such  analyses  are 
rarely  pursued  by  the  engineer  programmer  since  his  expertise  is  usually  not 
in  numerical  analysis.  From  his  perspective,  the  easiest  and  most  cost- 
effective  solution  has  often  been  simply  to  move  to  a  long-word-length  machine 
This  not  only  eliminates  many  actual  problems  caused  by  inadequate  computa¬ 
tional  accuracy,  but  also  allays  fears  that  accuracy  problems  are  lurking  in 
every  program.  The  engineer  programmer  then  has  confidence  in  the  machine 
and  can  concentrate  on  algorithms  for  his  application  rather  than  on  computa¬ 
tional  error  analysis. 


3.  Within  recent  years,  computer  systems  classed  as  large-scale  minis 
or  superminis  have  become  available,  offering  much  of  the  sophistication  and 
power  of  mainframes  at  a  much  lower  cost.  These  machines  are  exemplified  by 
the  three  families  addressed  herein:  the  VAX  11,  the  PRIME  550,  and  the 
Harris  Series  500.  The  functional  capabilities  and  performance/price  ratios 
of  such  systems  make  them  very  attractive  to  many  engineers.  Some  envision 
such  machines  dedicated  to  a  relatively  small  group  of  engineers — perhaps  a 
group,  section,  or  branch — where  there  is  no  contention  for  machine  resources 
from  management  and  business  programs  or  from  other  users  outside  the  group. 
The  ability  to  manage  and  control  one's  own  machine  resources,  rather  than 
sharing  a  central  facility,  has  considerable  appeal  to  many.  Whether  used  as 
a  "private"  system  or  a  central  facility  to  provide  a  part  of  the  computa¬ 
tional  requirements  of  a  larger  Corps  organization  (e.g.,  an  Engineer  Dis¬ 
trict),  this  class  of  computer  system  holds  great  interest  within  the  Corps 
of  Engineers.  Witness  the  recent  procurement  of  Harris  500  systems  and  the 
planned  Corps  of  Engineers'  Automation  Plan  (CEAP)  for  local  processor  systems 


Objectives 

4.  The  objectives  of  this  report  are  to  present  some  fundamentals  of 
FP  arithmetic  and  some  causes  of  substantial  loss  of  accuracy,  to  examine  the 
internal  representations  for  each  of  the  three  machine  families,  to  test  the 
relative  accuracy  of  the  machine  families  by  executing  a  set  of  test  programs 
on  representative  systems,  and  to  draw  some  conclusions  regarding  the  ac¬ 
curacy  inherent  in  each  machine  family.  The  test  programs  were  chosen  to  in¬ 
clude  both  "textbook"  and  "real  world"  problems.  The  problems  consisted  of  a 
simple  arithmetic  problem,  two  numerical  analysis  problems,  and  three  S&E 
problems.  The  same  set  of  data  for  each  program  was  used  on  each  of  the  sys¬ 
tems.  The  data  for  the  S&E  programs  represented  normal  conditions  and  were 
not  "cooked-up”  to  show  word-length  problems.  The  minicomputers  used  in  this 
study  were  the  Harris  500,  VAX  11/780,  and  PRIME  550.  Runs  were  also  made  on 
an  IBM  4331,  to  permit  comparison  with  the  very  well  known  and  widely  avail¬ 
able  IBM  FP  representation,  and  on  the  CDC  CYBER  6600,  a  large-scale  system 
whose  results  were  used  as  a  base  for  comparing  the  other  results. 


5.  It  is  not  intended  to  present  details  of  each  machine  s  internal 
algroithms  for  performing  FP  operations  (i.e.,  step-by-step  register  moves, 
shifts,  adds,  etc.),  nor  to  identify  anomalies  that  may  be  present  in  each 
vendor's  implementation  of  the  operations,  since  such  information  is  not 
readily  available  and  is  not  necessary  for  drawing  conclusions  regarding  rel¬ 
ative  accuracies  for  the  three  families  in  question.  All  runs  were  also  made 
on  a  CDC  CYBER  system  to  serve  as  a  baseline.  The  systems  were  not  selected 
due  to  their  competitiveness  in  either  performance  or  price.  Rather,  they 
were  selected  simply  as  three  available  systems  with  FP  architectures  repre¬ 
sentative  of  the  three  families  of  systems. 


Efficiency  Considerations 

6.  FP  operations  are  implemented  on  most  modern  minicomputer  systems  as 
part  of  the  computer  architecture--either  through  hardware  FP  units  or  through 
microprogramming.  How  the  FP  operations  are  implemented  greatly  affects 
machine  performance  for  S&E  applications;  i.e.,  hardware  implementations  are 
generally  much  faster  than  microprogrammed  implementations.  While  only  a  very 
cursory  and  inconclusive  comparison  of  program  execution  times  is  made  herin, 
it  should  be  noted  that  a  well-planned,  rigorous  comparison  of  FP  computation 
speeds  is  an  important  factor  in  selecting  a  minicomputer  for  S&E  applica¬ 
tions.  Also  note  that  all  three  families  of  systems  have  hardware  implementa¬ 
tions  either  as  standard  equipment  or  as  options. 


PART  II:  FUNDAMENTALS  OF  FLOATING-POINT  ARITHMETIC 


Basic  Notation 

7.  A  machine's  representation  of  FP  numbers  is  the  computer  equivalent 
of  the  human  representation  used  to  express  very  large  or  very  small  numbers; 
i.e.,  the  familiar  scientific  notation.  The  number  consists  of  two  parts: 
a  signed  part  usually  called  the  fraction  or  mantissa,  which  has  an  assumed 
radix  point  which  is  fixed;  and  a  part  called  the  exponent  which  is  the  power 
of  the  radix  by  which  the  fraction  is  multiplied  to  produce  the  value  repre¬ 
sented.  The  radix  is  the  number  base  for  the  representation;  e.g.,  in  scien¬ 
tific  notation  using  a  decimal  number  system,  the  radix  is  10  and  a  number  can 
be  represented  as 


X  *  10  **  Y 

where  X  is  the  fractional  part  and  Y  the  exponent.  Rather  than  using  10  as  a 
radix,  computer  systems  generally  use  a  radix  of  either  2,  8,  or  16  to  facili¬ 
tate  execution  of  FP  operations  on  numbers  stored  as  sequences  of  binary 
digits  (bits).  Where  the  radix  is  2  (i.e.,  a  binary  FP  representation),  a 
number  is  represented  as 


X  *  2  **  Y 

and  FP  operations  can  be  implemented  efficiently.  The  three  primary  machine 
families  examined  herein  use  a  binary  representation;  i.e.,  a  radix  of  2. 

8.  A  general  form  for  the  internal  representation  of  FP  numbers  can  be 
viewed  as  follows: 


B 

exponent 

B 

fraction 

A  single  bit  is  used  for  the  algebraic  sign  of  the  fraction,  an  exponent  sign 
bit  is  sometimes  used  (see  paragraph  9  for  an  alternate  exponent  sign  repre¬ 
sentation),  and  multiple  bits  are  used  for  both  the  exponent  and  fraction 
magnitudes.  In  this  conceptual  view,  moving  from  left  to  right  in  either  the 
exponent  or  fraction  fields  is  iving  f v  most  significant  bit  to  least 
significant  bit.  Some  representaM  ns  ,.lace  the  assumed  radix  point  to  the 


left  of  the  leftmost  fraction  bit,  while  others  place  it  right  of  the  right¬ 
most  bit.  A  common  characteristic  of  representations  is  that  FP  values  are 
"normalized";  i.e.,  numbers  are  stored  with  no  insignificant  leading  zeros  in 
the  fraction  so  that  the  maximum  number  of  significant  digits  may  be  stored. 
Normalization  is  easily  accommodated  since  the  value  of  the  exponent  deter¬ 
mines  the  effective  binary  point,  just  as  the  value  of  the  exponent  determines 
the  effective  decimal  point  in  scientific  notation.  The  fraction  can  be 
readily  adjusted  to  eliminate  leading  zeros,  accompanied  by  a  corresponding 
adjustment  of  the  exponent.  Thus,  the  size  of  the  exponent  field  determines 
the  range  of  numbers  which  can  be  represented,  and  the  size  of  the  fraction 
field  determines  the  precision  of  the  representation.  FP  precision  can  be 
characterized  by  the  number  of  singificant  digits  which  can  be  represented  or 
by  the  machine  "epsilon,"  or  £  ,  which  is  defined  as  the  smallest  number  such 
that  1.  +  e  >  1. 


A  Sample  Representation 

9.  Many  different  FP  number  representations  are  possible  with  differ¬ 
ences  in  radix,  location  of  radix  point,  treatment  of  exponent  and  fraction 
signs,  size  of  exponent  and  fraction,  and  treatment  of  negative  values.  For 
purp'-  s  of  exposition,  a  sample  FP  format  will  be  presented  based  on  a  32-bit 
word  whose  bits  are  numbered  starting  with  bit  zero  at  the  left  end  as  follows 

0  1  7  8 _ 31 

|  s  |  xxxxxxx { yyyyyyyyyyyyyyyyyyyyyyyy 


where 

s  represents  the  sign  of  the  fraction  (0  is  +,  1  is  -) 
y  represents  the  exponent 
x  represents  the  fraction 

In  this  format,  the  exponent  field,  rather  than  having  an  algebraic  sign,  is 
"biased"  as  a  means  of  allowing  for  a  negative  exponent.  A  biased  exponent 
is  one  for  which  a  zero  exponent  value  is  represented  by  an  exponent  field 
containing  a  1  for  the  leftmost  bit  and  0’s  for  all  others.  For  example,  a 
zero  exponent  would  be 


8 


1 _ 7 

1000000 


The  leftmost  bit  in  the  exponent  field  is  considered  the  "bias  bit,"  and  the 
exponent  would  have  a  bias  of  1  *  2  **  6  =  64.  Therefore,  positive  exponents 
would  be  represented  by  exponent  fields  containing  values  greater  than  64, 
and  negative  exponents  would  correspond  to  exponent  field  values  less  than  64. 
Specifically,  the  exponent  value  for  the  sample  format  is  equal  to  the  content 
of  the  exponent  field  minus  64.  An  exponent  of  +5  would  be  represented  as: 


1 _ 7 

11000 101 1  (6910) 

and  an  exponent  of  -5  would  be  represented  as: 


1 _ 7 

1011 101  l~l  (5910) 

This  sample  format  has  a  binary  exponent;  i.e.,  the  radix  is  2,  or  the  ex¬ 
ponent  denotes  the  power  of  2  by  which  the  fraction  is  multiplied  to  obtain 
the  value  represented.  The  binary  point  is  implicitly  located  to  the  left  of 
the  leftmost  fraction  bit,  and  the  fraction  field  is  a  signed-magnitude  repre¬ 
sentation  (i.e.,  the  fraction  field  for  a  negative  number  is  the  same  as  that 
for  the  same  positive  number);  but  the  algebraic  sign  field  is  different.  All 
values  are  "normalized";  i.e.,  numbers  are  stored  with  no  leading  zeros  in  the 
fraction  so  that  the  maximum  number  of  significant  digits  may  be  represented. 
Note  that  a  complement  form  rather  than  signed  magnitude  is  used  for  the  frac¬ 
tion  in  some  machines  such  as  the  PRIME.  The  PRIME'S  use  of  complement  form 
for  fractions  will  be  explained  in  Part  III.  Some  examples  of  values  repre¬ 
sented  in  the  sample  FP  format  are  as  follows: 


0178  31 


0 

1000001 

100000000000000000000000 

0 

1 

7 

8 

31 

E 

1000001 

100000000000000000000000 

0 

1 

7 

8 

31 

0 

0111111 

100000000000000000000000 

i 

i 


-2510  m-l2  *  2_1  = 

525. 51q  =  .100000110112  *  210  = 

-525. 51q  =  -.100000110112  *  210 


E 

0111111 

100000000000000000000000 

0 

l 

7 

8 

31 

E 

1001010 

100000110110000000000000 

0 

1 

7 

8 

31 

E 

1001010 

100000110110000000000000 

Basic  FP  Operation  Algorithms 

Addition  and  subtraction 

10.  These  operations  are  accomplished  by  first  aligning  the  radix  point 
of  the  two  operands  to  the  same  relative  position,  performing  the  required 
operation  (addition  or  subtraction)  on  the  fractions,  then  normalizing  the 
result.  Using  the  following  notation: 

x  and  y  =  operands 
s  =  sum 

e(x),  e(y),  and  e(s)  =  exponents  of  x,  y,  and  s,  respectively 
f(x),  f(y),  and  f(s)  =  fractions  of  x,  y,  and  s,  respectively 
an  addition  algorithm  for  the  sample  representation  can  be  stated  more  spe¬ 
cifically  as  follows: 

a.  If  e(x)  f  e(y),  then  select  the  operand  with  the  smaller  ex¬ 
ponent  and  shift  its  fraction  right,  incrementing  its  exponent 
by  1  for  each  bit  shifted,  until  e(x)  =  e(y);  i.e.,  the  radix 
points  are  aligned. 

b.  Add  the  operand  fractions  to  obtain  the  sum  fraction;  i.e., 

f  (s )  =  f  (x)  +  f  (y) 

One  of  three  exception  cases  may  then  occur: 

(1)  Case  1:  f(s)  =  0.  Then  set  e(s)  to  0  (most  negative 
value);  this  forces  s  to  0. 

(2)  Case  2:  f(s)  overflows.  Then  shift  f(s)  right  1  bit 
(shifting  the  overflow  bit  into  the  most  significant  posi¬ 
tion)  and  set  e(s)  =  e(s)  +  1. 

(3)  Case  3:  After  Case  2,  e(s)  overflows.  Then  set  the 
magnitude  of  s  to  the  largest  value,  maintaining  the 
proper  sign. 

c.  Normalize  s;  i.e.,  shift  f(s)  left  until  the  most  significant 
bit  is  1,  subtracting  1  from  e(s)  for  each  bit  shifted;  if 
e(s)  underflows,  set  s  =  0. 


The  subtraction  algorithm  is  the  same  except  that  in  step  b  the  operand  frac¬ 
tions  are  subtracted  rather  than  added. 

11.  Note  that  when  the  exponents  are  different,  loss  of  significance 
sometimes  occurs  in  the  operand  with  the  smaller  exponent  since  its  fraction 
is  shifted  right  and  bits  are  lost  as  they  are  shifted  out  of  the  least 
significant  bit  position.  The  larger  the  difference  in  the  order  of  magni¬ 
tude  of  the  operands,  the  greater  the  potential  loss  of  significance  in  the 
smaller  operand.  This  problem  will  be  addressed  further  later  in  this  part. 
Multiplication  and  division 

12.  Multiplication  and  division  operations  employ  the  following  mathe¬ 
matical  relations: 


x  =  f (x)  *  2  **  (e(x)) 
y  =  f (y)  *  2  **  (e(y)) 

x  *  y  =  (f(x)  *  2  **  e(x))  *  (f(y)  *  2  **  e(y)) 

=  (f(x)  *  f (y))  *  2  **  (e(x)  +  e(y)) 

x/y  =  (f (x)  *  2  **  e(x))/(f(y)  *  2  **  e(y)) 

=  (f (x)/f (y))  *  2  **  (e(x)  -  e(y)) 

Therefore,  the  multiplication  operation,  for  example,  can  be  accomplished  by 
multiplying  the  fractions,  adding  the  exponents,  and  normalizing  the  results. 
If  p  is  the  product  and  e(p)  and  f(p)  are  the  exponent  and  fraction  of  the 
product,  respectively,  an  algorithm  for  the  sample  representation  can  be 
stated  more  specifically  as  follows: 

a.  Multiply  the  operand  fractions  to  obtain  the  double-length 
product  fraction;  i.e., 

f (p)  =  f(x)  *  f (y) 

b.  Add  the  operand  exponents  to  obtain  the  product  exponent;  i.e. 

e(p)  =  e(x)  +  e(y)  -  bias 
One  of  three  exception  cases  may  occur: 

(1)  Case  1:  e(p)  overflows.  Then  set  f(p)  to  the  largest 
magnitude  fraction  with  the  proper  sign  and  set  e(p)  to 
the  largest  positive  exponent. 

(2)  Case  2:  e(p)  underflows.  Then  set  f(p)  to  0  and  set  e(p 
to  0  (most  negative  value);  this  forces  p  to  0. 

(3)  Case  3:  f(x)  or  f(y)  =  0.  Then  set  f(p)  to  0  and  set 
e(p)  to  0;  this  forces  p  to  0. 


c.  Normalize  the  double-length  product  (normalization  may  produce 
Case  2  above). 

d.  Round  the  product  to  the  proper  word  length,  renormalizing  if 
required. 

The  division  algorithm  is  similar  and  will  not  be  stated  here. 

13.  Note  that  the  multiplication  algorithm  includes  none  of  the  frac¬ 
tion  shifting  that  produces  lost  operand  bits  as  in  addition  and  subtraction. 
One  might  intuitively  believe  from  this  that  addition  and  subtraction,  rather 
than  multiplication  and  division,  are  the  major  culprits  in  FP  accuracy  loss. 
This  conclusion  can,  in  fact,  be  shown  to  be  true  (Knuth  1969),  and  it  is 
well  established  that  substantial  losses  in  accuracy  can  be  expected  from 
addition  and  subtraction  in  some  cases,  but  not  from  multiplication  and  divi¬ 
sion.  These  losses  are  addressed  in  the  next  section. 

Errors  in  Operations 

Large  differences  in  operand  magnitudes 

14.  From  the  description  of  the  addition  algorithm  presented  above,  it 
is  apparent  that  addition  or  subtraction  performed  on  two  operands  with  large 
differences  in  magnitudes  will  result  in  loss  of  significant  digits  in  the 
operand  of  lesser  magnitude — perhaps  even  of  the  entire  operand.  Using  the 
sample  FP  representation,  the  fraction  is  represented  by  24  binary  digits 
which  is  equivalent  to  less  than  8  decimal  digits.  Hence,  for  the  addition 

40000.  +  .0025 


the  second  operand  is  totally  lost  during  the  operation  and  the  sum  produced 
is  simply  40000.  However,  our  concern  is  generally  with  the  relative  error 
(i.e.,  the  magnitude  of  the  error  relative  to  the  true  result)  which  in  this 
case  will  be  less  than  1  in  10  **  7,  and  in  general  will  be  on  the  order  of 
2  **  (-n)  where  n  is  the  number  of  bits  in  the  fraction.  Therefore,  in  some 
cases  where  a  single  operation  produces  a  final  result,  loss  of  significance 
due  to  magnitude  differences  in  operands,  such  as  shown  in  this  example,  may 
be  unimportant.  In  other  cases  where  an  expression  requires  several  opera¬ 
tions  in  sequence,  errors  due  to  greatly  varying  operand  magnitudes  can  pro¬ 
duce  unexpected  results  if  the  order  in  which  operations  are  performed  is  not 


carefully  selected.  Specifically,  the  associative  law  and  the  distributive 
law  can  fail  rather  badly.  (See  Part  IV  for  examples.) 

Small  differences  in  operands 

15.  FP  subtraction  of  very  nearly  equal  values  (or  addition  of  values 
with  nearly  equal  magnitudes  but  opposite  signs)  can  produce  very  large  rela¬ 
tive  errors.  Again,  considering  the  sample  representation  with  7+  decimal 
digits  precision,  given 

s  =  x  -  y 

where 

x  =  2.575242 
y  =  2.575231 

then 

s  =  0.000011  or  11  *  10  **  (-6) 


which  is  only  a  2  significant  digit  result. 

16.  However,  the  potential  roundoff  error  in  each  of  the  original 
operands  is  approximately  .5  *  10  meaning  that  the  subtraction  has  produced 
only  1  reliable  significant  digit  from  two  7-digit  operands.  Thus,  the  maxi¬ 
mum  relative  error  in  this  example  is  potentially  very  high;  i.e.,  approxi¬ 
mately  1/11.  Improper  expression  construction  can  unnecessarily  produce  large 
relative  errors  when  the  associative  or  distributive  law  fails  due  to  nearly 
equal  operands.  (Again,  see  Part  IV  for  examples.) 


n;v 


PART  III:  FLOATING-POINT  REPRESENTATIONS  FOR  EACH  MACHINE 

The  VAX  11  Family 

17.  The  VAX  11  FP  representation  takes  advantage  of  the  fact  that  for 
signed  magnitude  normalized  numbers  the  high-order  fraction  bit  is  always  1 
(see  Part  II),  and  thus  in  the  stored  value  this  bit  is  redundant  and  need 
not  be  kept.  This  feature  allows,  in  effect,  the  gaining  of  an  extra  bit  of 
significance  from  a  given  size  stored  fraction.  To  facilitate  use  of  this 
concept,  the  hardware  restores  this  "hidden"  bit  before  performing  arithmetic 
operations  and  likewise  removes  the  bit  before  storing  in  memory  the  results 
of  an  operation.  Thus,  there  are  single-  and  double-precision  representations 
for  FP  numbers  in  memory  which  are  not  identical  with  the  corresponding  repre¬ 
sentations  in  the  arithmetic  unit.  Representations  are  shown  below  in  a  con¬ 
ceptual  sense;  i.e.,  fields  constituting  the  values  are  depicted  without  show¬ 
ing  the  VAX  numbering  scheme  for  bits  or  byte  addresses: 
a.  Single-precision: 

(1)  Value  in  memory: 


exponent 


8  bits  23  bits 


fraction 


(2)  Value  as  processed  by  arithmetic  unit: 


exponent 


8  bits  24  bits 


fraction 


b.  Double-precision: 

(1)  Value  in  memory: 


exponent 


8  bits  55  bits 


fraction 


(2)  Value  as  processed  by  arithmetic  unit: 


c.  As  previously  stated,  the  most  significant  fraction  bit  is  not 
present  in  values  in  memory,  but  is  restored  by  the  hardware 
before  arithmetic  operations  are  performed. 

d.  The  exponent  has  a  bias  of  128^;  i.e.,  200g  or  lOOOOOOO^. 

In  the  VAX's  implementation  of  algorithms  for  FP  operations,  the  arithmetic 
unit  uses  an  "overflow"  bit  on  the  left  and  two  "guard"  bits  on  the  right  to 
ensure  the  return  of  a  rounded  result  identical  with  the  corresponding 
infinite-precision  operation  rounded  to  the  specified  fraction  length;  i.e., 
to  ensure  correct  rounding.  Thus,  the  rounded  result  of  a  FP  operation  has 
a  roundoff  error  bound  of  half  of  the  least  significant  bit  of  the  fraction. 
Note  that  the  24-bit  single-precision  fraction  is  equivalent  to  approximately 
7  decimal  digits,  that  the  56-bit  double-precision  fraction  is  equivalent  to 
approximately  16  decimal  digits,  and  that  the  range  for  each  is  approximately 
.29  *  10“38  through  1.7  *  1038. 


The  PRIME  550  Series  Family 


19.  The  PRIME  550  Series  uses  memory  representations  of  32  bits  for 
single-precision  and  64  bits  for  double-precision  as  shown  below: 

a.  Single-precision: 


K 


fraction 


I 


exponent 


23  bits 


8  bits 


b.  Double-precision: 


fraction 


exponent 


47  bits 


16  bits 


20. 


Note  that: 

a.  The  sign  bit  applies  to  the  fraction  only. 

b.  The  fraction  uses  a  two's  complement  representation  for  nega¬ 
tive  numbers.  In  this  case,  "normalizing"  the  result  of  an 
operation  means  shifting  the  fraction  left  until  the  most 
significant  bit  differs  from  the  sign  bit,  and  the  exponent  is 
decreased  by  one  for  each  bit  shift. 

c.  There  is  no  "hidden"  or  "understood"  most  significant  fraction 
bit  as  in  the  VAX. 


d.  The  exponent  has  a  bias  of  128^  (i.e., 


200g  or  100000002)  for 


both  single-  and  double-precision  values  in  memory. 


The  FP  register  used  by  the  arithmetic  unit  for  single-precision  operations 
has  a  16-bit  exponent  and  31-bit  fraction.  Thus,  the  arithmetic  unit  can  use 
1  nger  fractions  and  larger  exponents  internally  while  executing  the  FP 
operation  algorithms,  and  representation  conversions  must  take  place  auto¬ 
matically  as  FP  values  move  between  the  arithmetic  unit  and  memory.  While 
the  added  length  of  the  arithmetic  unit  fraction  certainly  should  allow  cor¬ 
rect  rounding  to  23  bits,  roundoff  does  not  take  place  automatically  when  a 
floating  point  value  is  stored  from  register  to  memory.  Rather  the  extra 
8  bits  of  fraction  are  truncated  when  a  floating  store  is  executed.  In  this 
case,  the  error  bound  is  equal  to  the  value  of  the  least  significant  bit  of 
the  fraction.  With  double-precision  operations,  the  FP  register  representa¬ 
tion  is  the  same  as  the  memory  representation,  and  the  fraction  is  truncated 
at  47  bits. 

21.  One  might  question  the  utility  of  a  16-bit  exponent  in  the  FP 
register  while  only  8  bits  are  used  for  a  stored  value.  The  larger  exponent 
allows  the  generation  of  larger  magnitude  values  within  the  arithmetic  unit, 
but  an  attempt  to  store  such  a  value  produces  an  exception  condition,  the  pro¬ 
cessing  of  which  could  provide  for  retrieval  of  the  excess  magnitude  value 
using  special  instructions.  It  is  not  known  whether  PRIME  FORTRAN  provides 
any  facility  for  using  single-precision  values  with  16-bit  exponents,  but 
there  would  seem  to  be  little  use  for  such  a  facility.  The  standard  precision 
and  range  of  values  for  the  PRIME  are  approximately  the  same  as  those  for  the 
VAX  with  the  exception  that  the  47-bit  double-precision  fraction  is  equivalent 
to  approximately  14  decimal  digits  as  opposed  to  16  for  the  VAX. 

Harris  Series  500  Systems 

22.  The  Harris  Series  500  Reference  Manual  (Harris  Corporation  1978) 
specifies  single-,  double-,  and  quadruple-precision  FP  data  formats  as 
follows: 


b .  Double-precision  (and  single-precision  for  SAU  machines): 


fraction  (upper  23  bits) 


Word  1 


^fraction  q  exponent 

^  (lower  15  bits)  (7  bits) 


(1  bit  Word  2 
unused) 


c .  Quadruple-precision : 


(1  bit  Word  3  (1  bit  Word  4 

unused)  unused) 

23.  Note  that: 

a.  Both  the  exponent  and  the  fraction  are  signed. 

b.  Both  the  exponent  and  the  fraction  use  a  two's  complement 
representation  for  negative  numbers. 

c.  There  is  no  "hidden"  most  significant  fraction  bit  as  in  the 
VAX. 

d.  In  each  representation,  a  single  "unused"  bit  in  the  high-order 
position  of  each  fraction  word  after  the  first,  although  not 
used  in  the  FP  representation,  is  reserved  for  specific  usage 
during  execution  of  some  instructions. 

e.  The  results  of  all  operations  are  truncated,  not  rounded. 

24.  Upon  examination  of  the  data  formats,  one  might  immediately  ques¬ 
tion  the  value  of  the  single-precision  format  since  it  uses  the  same  amount  of 
memory  while  providing  significantly  less  precision.  In  fact,  Harris  documen¬ 
tation  clearly  indicates  that  for  the  Series  500  all  FP  operations  performed 
by  the  Scientific  Arithmetic  Unit  (SAU)  (i.e.,  its  FP  hardware)  are  executed 
in  the  double-precision  FP  format.  Thus,  use  of  single-precision  numbers 
would  seem  to  be  of  value  only  in  machines  without  the  SAU  option;  i.e.,  those 
in  which  FP  operations  are  constructed  from  sequences  of  integer  operations 
("software"  FP) .  In  this  case,  software  implementations  of  single-precision 
operations  could  definitely  execute  significantly  faster  than  those  for  double 
precision  operations.  While  it  seems  clear  that  use  of  single-precision 
values  is  prudent  only  for  non-SAU  equipped  machines,  the  Harris  FORTRAN  Man¬ 
ual  (Harris  Corporation  1981)  leaves  one  in  doubt  as  to  how  FP  data  types  an1 
implemented  for  SAU-equipped  machines.  Specifically,  the  Harris  FORTRAN 


Manual  shows  the  REAL  data  type  to  be  the  standard  single-precision  represents 
tion  shown  above  unless  the  compile  time  option  "P"  is  used  which  changes  the 
REAL  data  type  to  the  double-precision  representation  shown.  But  in  fact, 
execution  of  test  programs  (detailed  later)  shows  that,  for  SAU-equipped 
machines,  the  stated  double-precision  representation  is  actually  used  in¬ 
ternally  (both  in  memory  and  within  the  SAU)  for  all  REAL  values,  regardless 
of  the  use  of  the  "P"  option.  Therefore,  the  accuracy  of  computations  is 
identical  for  single-precision  and  double-precision  values.  The  only  detected 
difference  in  the  treatment  of  the  two  is  in  the  input/output  field  lengths 
provided.  Values  implicitly  or  explicitly  declared  to  be  single-precision 
have  input  and  output  field  lengths  truncated  to  the  specified  single¬ 
precision  length;  e.g.,  the  formatted  output  of  a  single-precision  value  is 
truncated  to  7  significant  decimal  digits.  Given  that  there  is  no  signifi¬ 
cant  advantage  to  using  so-called  "single-precision"  values,  it  is  probably 
wise  to  use  the  "P"  compiler  option  as  standard  practice  unless  there  is 
specific  justification  for  not  doing  so.  (Note  that  all  existing  Corps  of 
Engineers'  Harris  500  systems  have  the  SAU  option.) 

25.  Harris  also  provides  a  quadruple-precision  data  type,  as  shown  in 
paragraph  22,  apparently  implemented  via  software  FP  operations.  Such  soft¬ 
ware  implementations  of  extended-precision  data  types  are  generally  very  slow, 
a  fact  which  execution  of  the  test  programs  confirmed,  and  are  not  candidates 
for  widespread  usage  in  Corps  S&E  applications.  However,  they  can  be  valu¬ 
able  in  conducting  application  accuracy  studies  and  sometimes  in  production 
programs  when  extended-precision  data  items  are  carefully  selected. 

26.  In  summary,  the  Harris  500  Series,  with  optional  SAU,  provides  a 
data  type  used  for  both  single-  and  double-precision  values  which  has  a  38- 
bit  fraction  providing  approximately  11  decimal  digits,  and  a  software  imple¬ 
mented  quadruple-precision  data  type  providing  approximately  20  decimal  digits 

IBM  4331 

27.  The  IBM  4331  FP  representations,  shown  below,  are  used  widely 
throughout  the  IBM  product  line.  The  FP  accuracy  for  IBM  systems  is  not  the 
primary  subject  of  this  study  and  no  explanation  of  the  representation  will 
be  presented. 


Single-precision: 


S  exponent  fraction 
TTl  T8  31 


b .  Double-precision: 


S I  exponent  I  fraction 


0  1 


7  8 


31 


fraction  | 
32  63 


CDC  CYBER  6600 


28.  The  CDC  CYBER  6600  is  a  large-scale  computer  system  with  a  memory 
word  size  of  60  bits.  The  FP  representation  for  single-precision  format  is 
shown  below.  This  system  has  been  the  subject  of  several  accuracy  studies 
(Ward  1976,  O’Neil  and  Peterson  1976).  The  results  of  these  studies  have 
shown  the  CDC  6600  to  be  a  highly  accurate  system  for  the  Corps'  SSE  problems 
(For  this  study,  all  test  programs  run  on  the  CDC  system  used  only  single¬ 
precision  FP  values.) 


0  47  48  58  59 


fraction 


exponent 


S 


Further  FP  Architecture  Characteristics 

29.  Several  machine  accuracy  characteristics  which  may  be  of  interest 
(e.g.,  machine  epsilon,  largest  and  smallest  FP  numbers,  etc.)  can  be  deter¬ 
mined  from  actual  machine  execution  of  FP  operations.  In  Appendix  B  of  Cody 
and  Waite  (1980)  is  a  subroutine  called  MACHAR  for  determining  13  machine  con 
stants  which  allows  one  to  check  manufacturer's  claims.  Results  of  execution 
of  this  subroutine  on  each  of  the  five  test  systems  are  presented  in 
Appendix  A. 


PART  IV:  EXAMPLES  OF  FLOATING-POINT  ERROR 


30.  Since  FP  values  and  the  operations  on  such  are  approximations  of 
real  values  and  the  corresponding  exact  operations,  the  fundamental  laws  re¬ 
lated  to  operations  on  real  numbers  sometimes  break  down  producing  substantial 
and  even  dramatic  errors.  Examples  of  two  such  cases  and  of  the  effects  of 
fraction  length  and  treatment  of  rounding  are  given  below. 

Failure  of  the  Associative  Law  of  Addition 

31.  It  is  quite  common  practice  in  programming  to  apply  the  associative 
law  of  addition: 


(x  +  y)  +  z  =  x  +  (y  +  z) 

That  is,  in  performing  a  sequence  of  additions,  to  produce  a  single  sum,  the 
order  in  which  the  additions  are  performed  is  usually  not  considered  a  matter 
for  concern;  i.e.,  associativity  is  assumed.  This  is  not  surprising  since  com 
mon  mathematical  notations  for  summation  are  inherently  based  on  associativity 
However,  failure  of  the  associative  law  can  occur  in  FP  operations;  i.e., 

(x  +  y)  +  z  t  x  +  (y  +  z) 


in  some  cases. 

32.  For  example,  consider  a  hypothetical  FP  representation  with  exactly 
7  decimal  digits;  i.e.,  a  true  decimal  representation.  Then,  the  expression 

(5505026.  +  (-5505024.))  +  3.9375 
=  2.  +  3.9375 
=  5.9375 

which  is  the  exact  result. 

33.  However,  if  we  change  the  order  of  evaluation  such  that  the  ex¬ 
pression  becomes 

5505026.  +  (-5505024.  +  3.9375) 

=  5505026.  +  (-5505021.) 

=  5.0 

if  truncation  rather  than  rounding  takes  place,  or 


20 


Tv 


=  5505026.  +  (-5505020.) 

=  6.0 

if  rounding  takes  place,  neither  result  is  exact.  The  associative  law  has 
failed  in  both  cases,  but  where  rounding  was  not  used,  the  failure  is  much 
more  pronounced. 

34.  Now  consider  a  binary  FP  representation  with  a  23-bit  fraction 
(such  as  those  for  systems  examined  herein)  and  the  summation  expression 
above.  The  values  in  the  expression  have  fractions  and  exponents  as  follows: 


fraction 

(5505026.)  | 10101000000000000000010 
(5505024.)  | 10101000000000000000000 
(3.9375)  | 1 1 1 1 1000000000000000000 | 


exP  =  23io 

exp  =  231q 
exp  =  2 


As  above,  the  first  expression  for  the  sum  of  the  three  values  will  produce 
the  correct  result  and  the  second  will  not.  The  subexpression  (-5505024. 

+  3.9375)  again  is  of  interest  as  the  source  of  error.  The  addition  algorithm 
(see  Part  II)  shifts  the  fraction  for  3.9375  some  21  bit  positions  right  to 
align  the  binary  point,  resulting  in  the  following: 


(3.9375) 


fraction 

1 000000000000000000000 1 1 


exp  =  23. 


35.  In  effect,  3.9375  has  become  3.0  and  the  addition  will  produce 
-5505021.,  if  truncation  rather  than  rounding  takes  place.  However,  if  addi¬ 
tional  fraction  bits  are  employed  within  the  arithmetic  unit  to  achieve  cor¬ 
rect  rounding,  the  result  of  the  subexpression  is  -5505020.,  producing  a  com¬ 
plete  expression  result  of  6.0.  Thus,  in  this  case,  the  result  of  truncation 
and  rounding  in  the  FP  binary  representation  is  much  the  same  as  we  would  ex¬ 
pect  if  we  examined  a  true  decimal  representation  with  the  equivalent  number 
of  decimal  digits.  Such  is  not  always  the  case,  as  illustrated  by  the  next 
example  of  failure. 

36.  In  the  example  above,  single-precision  truncation  of  results  in 
the  PRIME  would  produce  an  expression  result  of  5.0  rather  than  6.0.  The  VAX, 
with  the  extra  "hidden"  bit  and  the  facility  for  correct  rounding,  would  pro¬ 
duce  6.0.  The  Harris  single-precision  or  use  of  double-precision  in  any  of 
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the  three  machines  would  produce  correct  results;  i.e.,  5.9375.  While  this 
example  was  specifically  chosen  to  illustrate  a  case  where  a  23-bit  fraction 
is  inadequate  to  support  associativity,  the  principle  holds  for  any  size 
fraction. 

Failure  of  the  Distributive  Law 

37.  Similarly  the  distributive  law 

x  *  (y  +  z)  =  (x  *  y)  +  (x  *  z) 

is  commonly  assumed  to  be  valid  for  FP  operations,  but  in  fact  can  fail  rather 
badly  in  some  cases.  Consider  again  the  hypothetical  FP  representation  with 
exactly  7  decimal  digits.  Given  that  x  =  20000.,  y  =  -6.0,  and  z  =  6.000003, 

x  *  (y  +  z)  =  20000.  *  (.000003) 

=  .06 

which  is  the  correct  result. 

38.  However,  if  we  attempt  to  apply  the  distributive  law, 

(x  *  y)  +  (x  *  z)  =  (20000.  *  -6.)  +  (20000.  *  6.000003) 

=  -120000.  *  120000.1 

=  .1 

if  correct  rounding  takes  place,  or 

=  -120000.  +  120000.0 
=  0. 

if  truncation  takes  place.  Thus,  the  failure  using  a  true  decimal  representa¬ 
tion  in  this  case  could  be  disastrous;  e.g.,  if  truncation  takes  place  and  the 
result  is  used  as  a  divisor. 

39.  However,  true  decimal  representations  are  rarely  used  for  FP 
arithmetic  (it  is  shown  here  only  for  illustrative  purposes).  Therefore,  let 
us  consider  this  example  using  a  binary  representation  with  a  23-bit  fraction, 
as  in  the  previous  example.  It  can  be  shown  (the  multiplication  is  lengthy 
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and  will  be  omitted)  that  for  such  a  representation,  the  distributed  expres¬ 
sion  above  produces  results  of 

.0625 

if  correct  rounding  is  achieved,  or 

.0469 

if  truncation  is  used.  Again,  neither  result  is  the  true  value  which  is  pro¬ 
duced  by  the  first  expression.  Thus,  the  distributive  law  has  failed;  i.e., 

x  *  (y  +  z)  t  (x  *  y)  +  (x  *  z) 

While  the  magnitude  of  the  error  is  small,  it  must  be  remembered  that  it  is 
the  relative  error  that  is  significant,  and  the  relative  error  here  is  large. 
It  is  also  significant  to  note  the  large  increase  in  the  relative  error  if 
rounding  is  not  used;  i.e.,  for  rounding 


i  (.0625  -  .06)  .0025 

relative  error  =  - - tcs—  = 

. Ob  .06 


=  4+% 


for  truncation, 


,  (.06  -  .0469)  .0131 

relative  error  =  ^ ^ ^ 


=  2l\ 


Such  errors  which  are  small  in  an  absolute  sense  but  large  in  a  relative  sense 
can  easily  produce  large  magnitude  errors  when  the  results  are  used  in  further 
computations;  e.g.,  if  the  result  above  is  used  as  a  divisor  with  a  large 
dividend,  the  absolute  error  produced  could  be  very  large. 

40.  Again  this  example  was  chosen  to  illustrate  a  severe  case,  but  such 
cases  could  easily  be  produced  in  real  programs. 


PART  V:  TEST  RESULTS 


41.  Two  sets  of  test  problems  were  run  on  the  computers  selected  for 
this  study.  The  problems  were  chosen  carefully  to  ensure  that  they  would  pro¬ 
vide  information  on  accuracy  of  the  computers.  The  first  set  is  a  group  of 
mathematical  problems.  The  problems  strain  each  system  to  a  considerable 
extent  in  its  ability  to  handle  computations.  A  simple  example  of  adding  a 
series  of  numbers  forward  and  backward  is  followed  by  an  evaluation  of  a  poly¬ 
nomial  function  and  the  inversion  of  a  simple  but  intriguing  matrix.  Complete 
listings  of  the  mathematical  programs  used  are  included  in  Appendix  B. 

42.  The  second  set  of  examples  was  chosen  from  Corps  of  Engineers  files 
This  set  represents  real  world  conditions,  and  the  problems  and  data  chosen 
were  what  one  could  reasonably  be  expected  to  encounter  in  design  and/or  anal¬ 
ysis  applications.  The  first  problem  is  an  analysis  of  a  flexible  sheet  pile 
bulkhead  using  the  finite  difference  approach.  The  second  problem  uses  the 
finite  element  technique  for  analysis  of  a  soil-structure  interaction  problem 
of  a  sheet  pile  embedded  in  soft  clay.  The  last  problem  involves  computing 
the  buckling  load  of  a  pile  using  again  the  finite  difference  procedure. 

These  problems  were  selected  to  show  that  engineers  must  be  careful  in  using 
computer  programs  in  machines  that  do  not  carry  a  sufficient  number  of  signif¬ 
icant  digits  in  computations. 


Mathematical  Problems 

Addition  of  a  series  of  integers — program  MAXJ 

43.  The  first  test  program  is  very  short  and  simple  but  demonstrates 
how  accuracy  can  be  lost  in  simple  arithmetic  operations. 

44.  One  of  the  basic  arithmetic  laws,  the  commutative  law  of  addition, 
says  that 

a  +  b  =  b  +  a 

i.e.,  in  adding  two  numbers,  it  does  not  matter  which  is  placed  first.  This 
law  holds  rigorously  for  arithmetic  in  which  all  numbers  are  exact,  even  when 
extended  to  apply  to  a  sequence  of  an  arbitrary  number  of  additions.  However, 
for  a  sequence  of  an  arbitrary  number  of  FP  additions,  the  law  does  not 
necessarily  hold. 
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45.  Program  MAXJ,  which  is  given  a  maximum  value  in  J,  will  calculate 
a  forward  and  backward  sum  of  the  positive  integers  from  1  through  J.  The 
program  first  computes  the  forward  sum  "SL"  in  which 

SL  =  1.  +  2.  +  3.  +  ...  ♦  Float(J-l)  +  Float(J) 
and  the  backward  sum  "SU"  in  which 

SU  =  Float(J)  +  Float(J-l)  +  Float(J-2)  +  ...  +2.  +1. 

MAXJ  then  calculates  the  difference  in  the  two  sums  as  "DIFF." 

46.  All  computers  have  a  maximum  FP  value,  determined  primarily  by 
their  maximum  exponent  value.  Thus,  in  computing  the  sum  of  a  sequence  of 
integers  from  1  to  J,  any  system's  FP  value  will  overflow  at  some  value  of  J. 
Computers  that  have  larger  exponent  fields  can  be  expected  to  overflow  at  a 
larger  value  of  J  than  those  with  a  small  exponent  field.  If  FP  number  repre¬ 
sentations  were  exact  for  all  sums  less  than  a  machine's  maximum  FP  value, 
then  it  would  be  possible  to  calculate  DIFF  in  the  forward  and  backward 

sums,  as  shown  above,  and  DIFF  would  always  be  zero  so  long  as  no  FP  over¬ 
flow  occurred.  However,  FP  representations  are  approximate  not  only  for 
fractional  values  but  also  for  integers  whose  magnitudes  exceed  the  number  of 
digits  which  can  be  represented  exactly.  Such  approximations  occur  frequently 
since  it  is  quite  common  for  FP  representations  to  accommodate  magnitudes  of 
10  **  38  but  only  store  the  equivalent  of  approximately  7  significant  decimal 
digits;  e.g.,  in  the  VAX  and  PRIME  representations.  Where  sums  exceed  the 
maximum  value  which  can  be  represented  exactly,  forward  summing  will  produce 
roundoff  errors  different  from  those  produced  by  backward  summing,  and  thus 
nonzero  values  of  DIFF  will  occur  long  before  the  occurrence  of  a  FP  value 
overflow.  The  results  of  program  MAXJ,  summarized  in  Table  1,  illustrate  this 
condition.  Program  runs  were  made  on  each  system  using  each  available  pre¬ 
cision  with  J  =  10^,  10^,  10^,  10^,  and  10^  in  that  order.  When  a  given  value 
of  J  produced  a  nonzero  value  of  DIFF,  no  further  runs  were  made  on  that  sys¬ 
tem  at  that  precision.  Only  single-precision  was  tested  on  the  CDC.  The 

Harris  and  the  CDC  were  the  only  systems  in  which  DIFF  was  0  with  J  as  large 

5  3 

as  10  in  single-precision.  The  other  systems  failed  beyond  J  =  10  .  In 

double-precision,  the  Harris  failed  beyond  10^,  as  in  single-precision,  but 
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Table  1 

Summary  of  MAXJ  Runs 


Computed  Results  ior  Cited  J 


System 

Precision 

103 

104 

105 

106 

107 

Single 

0 

0 

0 

99999. 

NR* 

Harris 

Single 

0 

0 

20630. 

NR 

Double 

0 

0 

20630. 

NR 

Quadruple 

0 

0 

0 

0 

IBM 

Single 

0 

-29552. 

NR 

NR 

NR 

Double 

0 

0 

0 

PRIME 

Single 

0 

-13408. 

NR 

NR 

NR 

Double 

0 

0 

0 

VAX 

Single 

0 

1972. 

NR 

NR 

NR 

Double 

0 

0 

0 

*  Not  run . 


all  the  other  minisystems  were  able  to  compute  up  to  107. 

Evaluation  of  a  polynomial  function — program  POLY 

47.  This  example  was  run  to  obtain  some  "overall"  comparisons  between 
the  systems.  First,  however,  the  concept  of  "noise  level"  will  be  explained. 

48.  Let  us  suppose  that  a  continuous  function  defined  by  y  =  f(x)  needs 
to  be  plotted.  In  a  computer,  to  plot  y  versus  x,  y  is  evaluated  for  various 
values  of  x.  Since  the  function  is  continuous,  one  would  expect  to  obtain  a 
nice  smooth  continuous  curve  for  y.  However,  if  the  function  is  complicated, 
due  to  rounding  errors,  a  scatter  of  values  could  result,  as  illustrated  be¬ 
low  in  an  exaggerated  manner. 


The  computed  function  would  lie  in  the  band  indicated  by  the  dotted  lines. 
One  could  also  produce  a  "mean  curve"  by  computing  not  simply  a  single  value 


of  y  for  each  selected  value  of  x,  but  the  mean  of  values  of  y  computed  at  x 
and  at  several  points  on  either  side  of  x.  If  the  arithmetic  were  exact 
(i.e.,  infinitely  precise)  and  the  points  on  either  side  of  x  were  chosen  to 
be  very  close  to  x,  then  the  computed  mean  curve  would  coincide  very  nearly 
to  the  exact  function  curve.  However,  given  the  limited  precision  of  FP 
representations,  this  mean  curve  will  not  necessarily  coincide  exactly  with 
the  function  y  =  f(x);  i.e.,  there  will  be  a  "bias."  The  computed  points  will 
be  scattered  around  the  mean  curve.  A  convenient  measure  of  this  scatter  has 
been  chosen  to  be  twice  the  standard  deviation  from  the  mean  curve  and  has 
been  designated  as  the  "noise  level"  (Noble  1982). 

49.  We  will  now  evaluate  a  particular  example  using  this  concept.  The 
polynomial  equation  whose  roots  are  the  first  20  integers  is  given  below: 


P9n(x)  =  (x  -  l)(x  -  2)  ...  (x  -  20) 

20  19  ,  18 

-  x  +  ajx  +  a2x  +  . . .  +  a2Q 


This  function  has  roots  that  are  very  sensitive  to  small  variations  in  the 
coefficients  of  the  higher  powers  of  x;  This  example  may  be  too  ill  condi¬ 
tioned  for  the  smaller  systems,  so  we  will  use  a  program  called  POLY  which 
uses  a  polynomial  whose  roots  are  the  first  11  integers: 


P  .(x)  =  (x  -  1 ) (x  -  2)  ...  (x  -  11) 

11  ,  10  .  9 

=  x  +  ajX  +  a£x  +  . . .  +  an 

Figure  1  shows  a  plot  of  this  function  when  P^(x)  is  evaluated  for  various 
values  of  x  (taken  from  Noble  1982). 

50.  The  evaluation  of  P^x)  at  the  smaller  x  values  shows  small  devia¬ 
tions  (noise  level),  but  as  the  x  values  increase  we  can  see  an  increase  in 
the  noise  level. 

51.  Program  POLY  first  computes  the  coefficients  of  the  polynomial, 
a^,  i  =  1  ...  11.  (Computation  of  the  a^  values  on  each  of  the  systems  pro¬ 
duced  identical  results  in  all  precisions.)  It  then  evaluates  mean  and 
standard  deviation  values  of  P^(x)  for  selected  values  of  x;  i.e.,  it  in  ef¬ 
fect  produces  the  mean  curve  and  standard  deviations  from  the  mean  at  selected 
points  on  the  curve.  Mean  and  standard  deviation  values  of  P^(x)  are  com¬ 
puted  from  a  set  of  values  obtained  by  computing,  by  nested  multiplication, 
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VERY  SMALL 
NOISE  LEVEL 


LARGE 


Figure  1.  Plot  of  a  polynomial  function  of  degree  11 

Pjj(z)  for  z  =  x  +  ie  for  i  =  -m,  -m+1,  ...  m-1,  m  (where  x,  m,  and  e  are  sup 
plied  by  the  user);  i.e.,  POLY  computes  2m  +  1  values  of  P^(z)  for  each 
selected  value  of  x  and  then  takes  the  mean  and  standard  deviation  of  these 
2m  +  1  values. 

52.  The  theoretical  mean  of  values  around  a  root  of  a  function  should 
be  close  to  zero  if  m  and  e  values  are  small  (since  P^(x)  =  0  at  the  roots). 
Computed  values  of  the  mean  and  standard  deviation  at  roots  can  be  used  as 
metrics  of  the  relative  accuracy  performance  of  the  various  systems. 

53.  The  same  set  of  data  was  supplied  to  each  system.  Parameters  used 
were  m  =  2  and  £  =  l.E-7.  The  mean  and  standard  deviation  of  P^(x)  were 
computed  at  each  root  of  the  polynomial;  i.e.,  at  x  =  1,  2,  ...  ,  10,  11. 
Tables  2-5  present  the  results  from  the  runs  on  all  of  the  systems. 

54.  It  is  obvious  from  the  results  that  different  systems  compute  dif¬ 
ferent  mean  and  standard  deviation  values.  This  effect  is  primarily  due  to 
differences  in  FP  precision.  When  we  examine  the  standard  deviation,  we  can 


Table  2 


POLY- Computed  Means  of  P...(x)  in  Single-Precision 


Value  less  than  ±0.05. 


Table  3 

PQLY-Computed  Means  of  Pjj(x)  in  Double-  and  Quadruple-Precision 


Mean  Computed  by  Cited  System  in  Cited  Precision 


Harris 

in 

IBM  in 

PRIME  in 

VAX  in 

X 

Double 

Double 

Double 

Double 

1.0 

Jt 

* 

* 

* 

-L 

A 

2.0 

“rC 

* 

* 

JL 

#> 

JL 

A 

3.0 

A 

A 

* 

* 

* 

* 

4.0 

-.1 

* 

* 

JL 

A 

* 

5.0 

-.2 

* 

* 

J- 

A 

JL 

6.0 

-.4 

* 

* 

* 

JL 

7.0 

-1.1 

JL 

JL 

-L 

A 

* 

8.0 

-1.7 

JL 

* 

JL 

JL 

A 

9.0 

-6.4 

* 

* 

JL 

JL 

10.0 

-16.1 

* 

* 

J. 

A 

* 

$3 


Standard  Deviation  Computed  by  Cited  System 


Harris  IBM  PRIME 


in  Double-  and  Quadruple-Precision 


Standard  Deviation  Computed  by  Cited  System  in  Cited  Precision 


Harris  in 


Double 


IBM  in 
Double 


PRIME  in 
Double 


VAX  in 
Double 


1.18 

1.11 

.08 

.Of 

.19 

.If 

.01 

.01 

.22 

.0^ 

.07 

.o: 

.19 

.0^ 

3.90 

.01 

4.0 

.01 

8.83 

.81 

21.27 

8.11 

04 

.0: 

07 

.0 

07 

.0 

81 

.61 

1 

8.4 

note  that  the  noise  level  of  the  Harris  in  single-precision  increases  for  the 
larger  values  of  x,  but  much  less  than  the  increase  in  noise  levels  of  the 
IBM,  PRIME,  and  VAX  single-precision  runs.  The  noise  levels  of  the  IBM,  PRIME 
and  VAX  in  double-precision  are  smaller  than  that  of  the  Harris  in  double¬ 
precision  runs,  and  very  close  to  those  of  the  CDC  and  Harris  quadruple- 
precision  runs.  As  expected,  the  results  of  POLY  show  the  Harris  to  be  more 
accurate  in  single-precision  than  the  IBM,  VAX,  or  PRIME  but  less  accurate 
than  these  systems  in  double-precision. 

55.  It  should  be  noted  that  this  is  a  severe  test  of  accuracy;  i.e., 
the  "noise"  produced  for  the  higher  values  of  x  is  extremely  sensitive  to  dif¬ 
ferences  in  FP  precision.  Any  machine  can  be  made  to  perform  badly  in  this 
test  by  choosing  polynomials  of  higher  and  higher  order.  However,  the  order 
of  the  polynomial  and  the  values  of  m  and  e  were  chosen  with  the  objective 

of  producing  a  reasonable  test  of  the  relative  accuracy  of  the  machines  in 
question. 

Inversion  of  a  matrix — program  MATRIX 

56.  The  last  "textbook"  test  problem  involves  a  numerical  analysis  pro¬ 
gram  that  will  invert  the  Hilbert  matrix  using  Gaussian  elimination.  This 
problem  has  been  used  in  many  studies  on  accuracy  because: 

a.  Gaussian  elimination  is  a  straightforward,  commonly  used  pro¬ 
cedure  for  inverting  matrices. 

b.  The  Hilbert  matrix  is  numerically  unstable  due  to  the  "close¬ 
ness"  of  the  numbers. 

c.  This  matrix  has  a  known  inverse  against  which  we  can  compare 
the  results  (Ward  1976). 

57.  The  Hilbert  matrix  is  an  n  x  n  matrix  of  the  following  general  form 

H.  .  =  T-— 4 - r 

ij  i  +  J  "  1 

For  example,  if  n  =  2,  then 


Or  if  n  =  3,  then 


1 


1 

2 


1 

3 


H3*3 


111 
2  3  4 


1  1  1 

[3  5  5  J 

58.  The  inverse  of  a  matrix  A  is  defined  as  a  matrix  A  *  such  that 


I 


where  I  is  the  identity  matrix  of  l's  on  the  main  diagonal  and  0's  elsewhere. 
For  example,  if  n  =  2,  then 


-6 

12 


59.  Ward  (1976)  gives  a  general  algorithm  of  the  Gaussian  elimination 


method.  The  algorithm  is  as  follows: 


a. 


b. 


c. 

d. 


e. 


f. 


i  «-  1. 

p  *■  H.  .  and  H.  .  «-  1 . 
r  li  n 

Divide  row  i  by  p. 

«-  HLj  *■  1  for  all  j  such  that  j  f  i  and  1  <  j  <  n. 

Multiply  row  i  by  and  subtract  this  product  from  row  j  for 
all  j  such  that  j  f  i  and  1  <  i  <  n. 
i  <-  i  +  1. 


g.  If  i  <  n,  then  go  to  step  b.  Otherwise,  stop  here.  H  now  con¬ 
tains  the  inverse  of  the  n  x  n  Hilbert  matrix. 


:-h 


a 

J 


60.  Program  MATRIX  was  first  executed  on  each  system  to  compute  the 
inverse  of  a  2  x  2  Hilbert  matrix  to  verify  the  validity  of  the  program. 

(Note:  the  input  values  used  in  the  program  are  not  the  exact  values  of  the 
imput  matrices  because  the  computers  rounded  or  truncated  the  values  to  their 
greatest  number  of  significant  digits.)  The  computed  inverse  matrix  was  sub¬ 
tracted  from  the  known  exact  inverse  matrix  (Figure  2)  to  obtain  an  approximate 
absolute  error  matrix.  The  error  matrices  for  all  the  systems  were  very 
small.  Next,  a  6  x  6  Hilbert  matrix  (see  Figure  3)  was  run  on  each  of  the 
systems.  The  error  matrices  for  each  system  are  shown  in  Figures  4-8  (the 
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Figure  2.  Exact  analytical  solution  of  a  6  x  6  Hilbert  matrix 
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Figure  3.  6x6  Hilbert  matrix 


Figure  4.  CDC  approximate  absolute  error  matrix 
(*  indicates  value  less  than  1.0  E-6) 
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Figure  5.  Harris  approximate  absolute  error  matrices 
(*  indicates  value  less  than  0.05) 
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Figure  7.  PRIME  approximate  absolute  error  matrices 
(*  indicates  value  less  than  0.05) 
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Figure  8.  VAX  approximate  absolute  error  matrices 
(*  indicates  value  less  than  0.05) 
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results  have  been  rounded  to  1  decimal  place). 

61.  The  CDC,  the  baseline  system,  has  a  very  small  absolute  error 
matrix.  The  Harris  is  the  only  system  with  a  small  error  matrix  in  single¬ 
precision.  The  double-precision  approximate  error  matrices  for  the  PRIME  and 
VAX  were  smaller  than  that  for  the  Harris  in  quadruple-precision.  The  IBM 
had  large  error  matrices  in  both  single-  and  double-precision. 

Real  World  Problems 

62.  For  the  S&E  problems  in  this  study,  three  computer  programs  were 
selected  that  have  been  in  use  for  several  years  and  have  been  thoroughly 
validated.  These  programs  solve  certain  types  of  soil-structure  interaction 
problems  that  are  complex  and  for  which  no  simple  closed-form  solution  exist. 
Often  the  algorithm  employed  in  the  solution  of  a  soil-structure  interaction 
problem  will  yield  incorrect  results  if  the  number  of  significant  figures 
carried  in  the  calculation  is  insufficient. 

63.  Three  problems  were  chosen  to  show  that  different  results  can  be 
obtained  when  identical  problems  are  run  on  each  of  the  systems  tested  using 
the  same  program.  The  problems  have  already  been  used  in  a  previous  study 
(O'Neil  and  Peterson  1976)  with  the  same  programs  on  a  CDC  system.  The  re¬ 
ported  solutions  from  the  CDC  were  duplicated  in  this  study.  The  conclusions 
from  the  previous  study  indicated  that  the  CDC's  solutions  were  acceptable; 
therefore,  these  solutions  are  used  for  comparison  with  the  results  from  the 
systems  tested. 

Case  1 — flexible  sheet  pile  bulkhead  in  sand 

64.  The  deflections  and  moments  produced  in  a  flexible  sheet  pile  bulk 
head  embedded  in  sand  are  to  be  computed  for  the  physical  system  depicted  in 
Figure  9.  The  problem  was  solved  by  using  a  computer  code  (BMCOL  28)  which 
establishes  the  finite  difference  equations  for  a  beam  on  an  elastic  subgrade 
at  predesignated  equally  spaced  nodes  along  the  bulkhead  (Matlock  and  Ingram 
1963).  The  system  of  linear  difference  equations  so  generated  forms  a  matrix 
equation  in  which  the  stiffness  matrix  is  tightly  banded.  Recursive  tech¬ 
niques  are  used  to  solve  for  the  deflection  at  each  node,  and  moments,  shears 
and  soil  reactions  are  subsequently  computed. 

65.  It  is  common  practice  to  analyze  problems  like  that  shown  in  Fig¬ 
ure  9  by  utilizing  50  to  100  nodes,  but  such  solutions  may  be  relatively 


X 


100 

200 

300 

-  400 
x* 

g  500 

600 

700 

800 

900 

Figure  9.  Physical  system  for  Case  1 

crude,  especially  if  rapid  moment  gradients  occur.  Improved  solutions  can  be 
obtained  by  increasing  the  number  of  nodes;  i.e.,  by  decreasing  the  increment 
length  of  finite  spacing  between  nodes.  However,  as  the  increment  length  de¬ 
creases  below  some  value,  the  difference  in  deflection  between  two  adjacent 
nodes  decreases,  which  results  in  poorly  defined  derivatives  and  hence  in  an 
invalid  solution.  The  increment  length  at  which  unrealistic  answers  are  out¬ 
put  is  a  function  of  the  physical  parameters  input  and  of  the  number  of 
significant  digits  used  in  the  machine  calculations. 

66.  Soil  and  bulkhead  parameters  input  for  this  problem  are  shown  in 
Figure  10.  Specific  numerical  values  given  are  for  a  solution  using  900  in¬ 
crements.  Identical  runs  were  made  with  9,  100,  225,  300,  400,  450,  and  900 
increments.  The  results  are  summarized  in  Figures  11-16. 

67.  The  solutions  from  the  IBM,  PRIME,  and  VAX  in  single-precision  are 
inconsistent  and  deteriorate  as  the  number  of  increments  increases.  The 
Harris  single-precision  runs  were  identical  with  the  Harris  double-precision 
runs.  All  double-precision  runs  on  all  of  the  systems  were  virtually  identical 
with  the  solution  given  by  the  CDC,  with  the  exception  that  the  Harris  was  off 


-  TIE  BACK-4 - 

^  WATER  LINE 

•* 

•• •  • 

*•••  * 

^-MZ-38  SHEETING 

DENSE  SAND  vY-V 
•  •*•••< 

••••  •  •' 

v  •  •  *  •  , 

•  •  •«  1 
••••: 

••••• 

••  •  • 
\v< 

DREDGE  LINE 

v.-: 

••  •, 

- 

••  •• 

•;  ;.v  •  :*.*’• . .  ' 

*  •  •••• 

■V  ••• 

DFNSE  SAND 

i-Y-v.V 

/.v. 

_ 

••V 

••• 

.y 

•s.' 


NOTE:  Q  =  IMPOSED  LOAD/INCREMENT 

S  =  SPRING  CONSTANT  FOR  ONE  INCREMENT 
ALL  VALUES  FOR  INCREMENTAL  LENGTH  OF  1  IN. 
1-FT-WIDE  SECTION 


Figure  10.  Soil  and  bulkhead  parameters  for  Case  1 
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Figure  12.  Maximum  moment  versus  number  of  increments,  Case  1,  single-precision 


13.  Maximum  deflection  versus  number  of  increments,  Case  1,  double-precision 


Maximum  moment  versus  number  of  increments,  Case  1,  quadruple-precision 


slightly  only  on  the  900-increment  run.  Thus,  it  can  be  concluded  that  the 
only  acceptable  system  for  single-precision  execution  of  Case  1  is  the  Harris, 
while  all  systems  are  acceptable  in  double-precision. 

Case  2 — steel  pile  in  clay 

68.  A  steel  pipe  pile  is  driven  into  soft  clay,  as  shown  in  Figure  17. 

A  load  of  31.4  kips*  is  applied  to  the  butt  of  the  pile,  resulting  in  an 
applied  load  of  5000  lb/ radian  as  viewed  from  the  top.  It  is  desired  to  de¬ 
termine  the  distribution  of  total  vertical  stress  in  the  soil  surrounding  the 
pile. 

69.  The  problem  was  solved  using  a  simple  axisymmetric  finite  element 
computer  code  called  AXSYM  (Wilson  1965),  assuming  linear  stress-strain  be¬ 
havior  in  both  the  pile  material  and  the  soil.  The  elastic  parameters  are 
shown  in  Figure  17,  and  the  finite  element  model  is  depicted  in  Figure  18. 

The  particular  code  used  in  this  study  employs  quadrilateral  elements  composed 
of  four  constant  strain  triangles.  As  with  most  axisymmetric  finite  element 
codes,  two  stiffness  equations  are  developed  for  each  node  in  the  system  (one 
for  vertical  and  one  for  horizontal  displacement),  and  the  equations  are 
assembled  into  a  global  matrix  equation.  The  pairs  of  equations  containing 
stiffness  contributions  from  the  pile  (e.g.,  for  node  13)  contain  terms  of 
large  magnitude,  whereas  those  containing  stiffness  contributions  only  from 
the  soil  (e.g.,  node  14)  contain  terms  of  much  smaller  magnitude.  The  ratio 
of  these  magnitudes  is  very  roughly  the  ratio  of  the  elastic  moduli  of  the 
materials.  Equation  pairs  appear  in  the  global  equation  in  order  of  nodal 
numbering;  hence,  since  nodes  are  numbered  horizontally  to  minimize  matrix 
bandwidth,  along  each  horizontal  row  of  nodes  there  exists  a  pair  of  rows  in 
the  stiffness  matrix  with  large  terms  followed  by  a  pair  with  very  small  terms 

70.  The  global  stiffness  matrix  equation  is  solved  by  a  decomposition 
procedure  that  is  mathematically  equivalent  to  Gaussian  elimination.  In 
essence,  the  terms  in  the  rows  of  the  matrix  having  large-magnitude  entries 
are  multiplied  by  ratios  obtained  by  dividing  small  terms  in  the  row  being 
reduced  by  the  large-magnitude  entries.  These  results  are  then  added  to  the 
small  terms  in  the  row  being  reduced.  The  effect  is  that,  except  for  the 
terms  being  eliminated,  the  small  terms  in  the  row  being  reduced  are  changed 


*  A  table  of  factors  for  converting  non-SI  units  of  measurement  to  SI 
(metric)  units  is  presented  on  page  3. 
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Figure  18.  Finite  element  model  for  Case  2 
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only  in  their  higher  significant  figures,  and  some  of  the  significance  of  the 
physical  system  may  be  lost  if  the  number  of  significant  figures  being  used  in 
the  calculations  is  insufficient.  Since  many  such  "hard-soft"  interfaces 
occur,  the  errors  so  produced  become  mangified. 

71.  Errors  of  the  type  just  described  can  be  reduced  (but  not 
eliminated)  by  restructuring  the  matrix  stiffness  equation  (renumbering  the 
nodes) ,  but  restructuring  can  result  in  matrices  with  larger  bandwidths  re¬ 
quiring  considerably  more  storage. 

72.  The  solutions  to  the  problem  are  summarized  in  Figures  19-21.  Fig¬ 
ure  19  shows  that  the  Harris  single-precision  solution  is  the  only  one  that 
matches  the  CDC  solution.  The  PRIME  double-precision  solution  (Figure  20) 
varies  from  the  CDC  solution  when  the  depth  is  less  than  50  in.  All  other 
double-  and  quadruple-precision  runs  match  those  from  the  CDC. 


Case  3 — pile  buckling  analysis 

73.  A  problem  of  interest  to  the  geotechnical  engineer  is  the  computa¬ 


tion  of  the  buckling  load  on  a  pile  driven  through  soft  soil  to  bedrock.  The 
physical  problem  depicted  in  Figure  22  was  solved  using  a  version  of  the  BMCOL 
finite  difference  computer  code  described  in  Case  1  that  includes  axial  load 


in  the  formulation  of  the  equations  at  the  nodes.  However,  in  this  case,  it 
was  necessary  to  model  load-deflection  characteristics  of  the  soil  in  a  non¬ 
linear  fashion.  This  so-called  "p-y"  input,  shown  for  a  typical  increment  of 
the  pile  in  Figure  22,  is  based  on  published  criteria  for  one-time  loading  of 
the  soil.  Other  necessary  data  are  described  in  Figure  22. 

74.  The  applied  load  was  assumed  to  act  with  an  eccentricity  of  1  in. , 
resulting  in  a  concentric  axial  load  and  a  moment  at  the  top  of  the  pile.  The 
load  was  increased  from  the  Euler  buckling  load  of  approximately  300  kips 
(assuming  the  pile  is  pinned  at  both  ends)  to  approximately  900  kips.  The 
moment  was  also  increased  in  proportion  to  the  loads.  No  axial  load  transfer 
was  assumed  to  occur  between  the  top  and  the  tip.  Primary  output  needed  to 
evaluate  the  buckling  load  is  lateral  deflection  at  the  top  of  the  pile  versus 
applied  load.  Figures  23-35  show  the  solutions  obtained  on  each  of  the  sys¬ 
tems.  The  Harris  results  conformed  closely  to  the  CDC  results  except  that 
substantial  differences  occurred  at  or  near  the  buckling  point  (Figures  23-25). 
The  other  systems  were  badly  in  error  in  single-precision  but  were  identical 
with  the  CDC  solution  in  double-precision.  The  Harris  quadruple-precision 


result  was  also  identical  with  the  CDC  solution. 
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80  FT 


Maximum  lateral  deflection  versus  applied  load,  Case  3,  single-precision,  all  systems 


Maximum  lateral  deflection  versus  applied  load,  Case  3,  single-precision,  CDC  system 


Maximum  lateral  deflection  versus  applied  load.  Case  3,  single-precision,  IBM  system 


Figure  27.  Maximum  lateral  deflection  versus  applied  load,  Case  3,  single-precision,  PRIME  system 


Figure  28.  Maximum  lateral  deflection  versus  applied  load,  Case  3,  single-precision,  VAX  system 


Maximum  lateral  deflection  versus  applied  load,  Case  3,  double-precision,  all  systems 


Maximum  lateral  deflection  versus  applied  load,  Case  3,  double-precision,  IBM  system 


Maximum  lateral  deflection  versus  applied  load,  Case  3,  double-precision,  PRIME  system 


Figure  33.  Maximum  lateral  deflection  versus  applied  load,  Case  3,  double-precision,  VAX  system 


Maximum  lateral  deflection  versus  applied  load,  Case 
quadruple-precision,  CDC  and  Harris  systems 


Maximum  lateral  deflection  versus  applied  load,  Case  3,  quadruple-precision,  Harris  system 


PART  VI:  EXECUTION  TIME  COMPARISONS 


75.  While  it  was  not  a  primary  objective  of  this  study  to  compare  test 
program  execution  times  on  the  systems  tested,  accuracy  considerations  do 
interrelate  with  execution  efficiency  since  there  is  often  a  trade-off  between 
accuracy  achieved  and  execution  time  expended.  In  fact,  increased  accuracy  is 
always  possible  through  software  implementation  of  extended-precision  arith¬ 
metic  which  exacts  a  heavy  toll  in  execution  time.  And,  as  stated  earlier, 
execution  efficiency  of  FP  operations  cannot  be  ignored  when  evaluating  mini¬ 
computer  systems  for  S&E  applications.  Thus,  the  compile  and  execution  times 
which  were  collected  as  a  by-product  of  running  the  test  programs  are  pre¬ 
sented  in  Tables  6-9.  Some  points  to  note  are: 

a.  As  would  be  expected,  compile  time  differences  between  single¬ 
precision  and  double-precision  compilation  on  a  given  machine 
were  in  each  case  insignificant. 

b.  For  most  programs,  compile  times  on  the  three  primary  systems 
tested  show  that  the  VAX  was  fastest,  followed  by  the  PRIME, 
and  then  the  Harris,  with  time  ratios  (slowest  to  fastest)  of 
no  more  than  approximately  1.5.  There  were  exceptions  to  this 
ranking;  e.g.,  the  PRIME  compile  time  for  program  MATRIX  did 
not  fit  the  pattern. 

c.  Execution  time  differences  between  single-precision  and  double- 
precisi.on  object  codes  on  a  given  machine  were  in  most  cases 
insignificant.  There  was  no  clear  pattern  to  indicate  that 
double-precision  execution  required  substantially  more  time 
than  single-precision  on  any  of  the  three  systems. 

d.  For  the  larger,  longer  running  "real  world"  programs,  the  exe¬ 
cution  time  rankings  were  the  same  as  those  for  compile  time; 
i.e.,  VAX,  PRIME,  and  Harris,  in  that  order,  with  time  ratios 
(slowest  to  fastest)  ranging  from  approximately  1.5  to  2.6. 
However,  for  the  shorter  "textbook"  programs,  the  speed  rank¬ 
ings  were  Harris,  VAX,  then  PRIME,  with  the  Harris  far  outper¬ 
forming  the  other  two  in  these  cases.  Since  the  Harris  had  the 
slowest  times  for  the  "real  world"  programs  and  the  fastest 
times  for  the  shorter,  "textbook"  programs,  it  might  be  sur¬ 
mized  that  the  textbook  program  execution  times  are  probably 
dominated  by  program  initialization  time,  with  the  Harris  out¬ 
performing  the  other  two  systems  in  this  task.  Regardless  of 
the  actual  explanation  of  the  contradictory  timings,  the  real- 
world  execution  times  are  probably  much  more  valuable  as  per¬ 
formance  comparison  metrics . 

e.  The  PRIME,  VAX,  and  IBM  execution  times  in  double-precision 
were  generally  shorter  than  those  observed  for  the  Harris  in 
single-  or  double-precision. 


Table  9 

Execution  Time  Comparisons  by  Case 


System 


C 

Harris 

IBM 

PRIME 

VAX 


Execution  Time,  sec,  for  Cited  Precision 
Single  Double 


1.417 

19.89 

82.21 

17.036 

12.29 


Case  1 


18.97 

130.34 

19.075 

13.48 


Case  2 


Harris 

58.04 

58.10 

254.58 

IBM 

83.15 

139.51 

PRIME 

34.112 

35.190 

VAX 

22.68 

Case  3 

31.66 

CDC 

3.04 

Harris 

37.73 

38.54 

181.80 

IBM 

53.34 

90.31 

PRIME 

36.001 

37.848 

f.  Harris  quadruple-precision  execution  times,  except  for  one 
textbook  program  (MATRIX) ,  were  substantially  longer  than 
Harris  double-precision  times.  For  real  world  programs, 
quadruple/double  time  ratios  ranged  as  high  as  approximately 
5.0.  For  programs  with  a  high  percentage  of  total  execution 
time  expenditure  in  FP  computations,  it  is  certainly  conceiv¬ 
able  that  the  ratio  may  go  substantially  higher.  Such  time 
penalties  are  generally  unacceptable  for  production  execution 
of  long-running  S&E  programs. 

76.  The  timing  comparisons  are  presented  here  only  as  interesting 
observations  that  came  from  the  accuracy  tests.  The  reader  should  avoid  draw¬ 
ing  erroneous  conclusions.  Specifically: 

a.  No  "benchmark"  or  performance  study  was  intended  or  designed. 
Programs  were  selected  only  to  measure  computational  accuracy 
and  not  to  measure  performance.  The  program  mix  and  indeed 
the  entire  approach  to  the  problem  would  have  been  substan¬ 
tially  different  had  this  been  a  comparative  performance  study. 

b.  Past  experience  has  shown  that  program  timings  obtained  on  some 
minicomputer  systems  are  greatly  affected  by  system  load  condi¬ 
tions;  on  some  systems  much  more  than  others.  For  example,  an 
identical  program  execution  may  register  much  less  execution 
time  under  a  "no-load"  or  uniprogramming  condition  than  during 
a  period  of  heavy  system  load.  Timings  presented  here  were  not 
obtained  under  no-load  conditions  to  eliminate  system  loading 
influences.  Neither  were  attempts  made  to  control  load  condi¬ 
tions  during  program  runs  so  that  similar  conditions  would  ex¬ 
ist  on  all  systems.  In  fact,  load  conditions  in  most  cases  were 
unknown  during  program  runs.  Thus,  the  "repeatability"  of  the 
timings  is  highly  suspect. 

c .  The  specific  systems  used  in  this  study,  the  VAX  11/750,  the 
PRIME  550,  and  the  Harris  500,  were  not  selected  to  be  directly 
competitive  in  performance  or  price.  No  attempt  was  made  to 
determine  the  exact  configurations  on  which  the  programs  were 
executed,  the  price  of  the  systems  used,  or  the  vendor's  per¬ 
formance  rating  of  the  particular  system  and  configuration 
relative  to  others  in  his  product  line  or  in  competitor's 
product  lines.  In  short,  the  systems  were  not  selected  as 
approximately  equally  priced  alternatives  for  competitive  pro¬ 
curement,  but  simply  as  three  available  systems  with  FP  archi¬ 
tectures  representative  of  the  three  families  of  systems. 
Therefore,  no  conclusions  should  be  drawn  comparing  the  general 
performance  or  price/performance  of  the  VAX  11  family  versus 
the  PRIME  550  family  versus  the  Harris  Series  500  family. 
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PART  VII:  CONCLUSIONS  AND  RECOMMENDATIONS 


Relative  Accuracy  of  the  Three  FP  Architectures 

77.  Both  examination  of  the  FP  representations  and  execution  of  the 
test  programs  indicate  that,  for  single-precision  computations,  the  Harris  is 
substantially  more  accurate  than  either  of  the  other  two  systems  and  the  VAX 
is  slightly  more  accurate  than  the  PRIME.  Harris  superiority  is  achieved  by 
use  of  the  double-precision  representation  for  all  FP  values  in  machines 
equipped  with  a  SAU.  This  feature  provides  the  equivalent  of  approximately 
11  decimal  digits.  While  both  the  VAX  and  the  PRIME  store  a  23-bit  fraction 
providing  the  equivalent  of  approximately  7  decimal  digits,  the  VAX  is  more 
accurate  due  to: 

a.  The  added  "hidden"  bit  giving  it  a  24-bit  rather  than  23-bit 
fraction. 

b.  The  achievement  of  correct  rounding. 

78.  However,  the  test  cases  indicate  that  users  should  be  wary  of  using 
single-precision  computations  for  S&E  problems  on  any  of  the  three  systems . 

The  VAX  and  PRIME  single-precision  runs  failed  rather  badly  in  all  three  of 
the  "real  world"  test  cases.  While  the  Harris's  11 -digit  representation 
failed  to  provide  sufficient  accuracy  in  only  one  of  these  three  cases,  it  too 
must  be  considered  suspect  since  it  fails  to  provide  the  level  of  confidence 
needed  to  allay  doubts  as  to  its  accuracy  when  used  for  such  problems.  It 
must  be  remembered  that  results  obtained  at  one  or  more  points  during  computa¬ 
tion  may  in  fact  be  accurate  and  that,  while  final  results  may  not  appear  un¬ 
reasonable,  they  may  in  fact  be  wrong. 

79.  Double-precision  representations  provide  approximately  16,  14,  and 
11  decimal  digits  for  the  VAX,  PRIME,  and  Harris,  respectively.  All  have 
hardware  implementations  of  double-precision  operations.  The  16-digit  VAX 
data  format  produced  no  failures  in  any  of  the  tests  and  could  be  used  with 
confidence  for  common  S&E  applications  in  the  Corps  requiring  highly  precise 
data  values.  The  PRIME'S  14-digit  double-precision  results  conformed  well  to 
the  CDC  "standard"  in  all  but  one  test  case  (real  world  Case  2),  where  there 
was  a  small  but  significant  deviation.  The  PRIME'S  double-precision  would 
probably  suffice  for  the  great  majority  of  applications  but  would  provide  a 
somewhat  lower  level  of  confidence  than  the  VAX  or  CDC.  As  previously  stated 
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(Part  III),  Harris  double-precision  and  single-precision  representations  are 
the  same  in  SAU-equipped  machines;  thus,  the  single-precision  comments  above 
apply  here  as  well. 

80.  Previous  experience  with  software  implementations  of  FP  arithmetic 
leads  to  the  conclusion  that  use  of  the  Harris  quadruple-precision  capability 
would  exact  too  great  a  performance  penalty  for  many  applications.  Timings 
from  the  test  cases  executed  for  this  study  confirm  this  belief. 

81.  In  summary,  this  study  indicates  that: 

a.  Single-precision  arithmetic  should  be  avoided  as  a  standard 
practice  when  using  either  the  VAX  or  the  PRIME  system  for 
S&E  applications  in  the  Corps.  This  is  not  to  say  that  such 
use  should  be  totally  prohibited,  but  that  it  should  be  re¬ 
stricted  to  cases  where  a  careful  analysis  and  thorough  under¬ 
standing  of  the  problem  to  be  solved  and  the  programmed  solu¬ 
tion  indicate  that  the  choice  is  prudent.  Since  it  is  not 
necessarily  known  a  priori  when  double-precision  is  needed 
(this  may  be  more  data-dependent  than  program-dependent),  the 
cost  of  making  a  definite  determination  will,  in  most  cases,  be 
unjustified.  Thus,  it  is  recommended  that  double-precision  be 
used  as  standard  practice  when  using  either  the  VAX  or  the 
PRIME  system.  The  double-precision  arithmetic  of  both  systems 
proved  to  be  highly  accurate  in  the  tests  reported  herein,  with 
the  VAX  having  a  small  but  significant  advantage. 

b.  Since  the  Harris  uses  the  same  representation  for  both  single- 
and  double-precision  values,  the  options  are  limited  to 
standard-precision  or  software-implemented  quadruple-precision. 
Quadruple-precision  execution  times  will  probably  be  prohibi¬ 
tive  for  most  precision-sensitive  Corps  applications;  thus, 
the  question  reduces  to  whether  Harris  standard-precision  is 
sufficient.  Based  on  this  study,  the  answer  seems  to  be  that 
for  many  (perhaps  most)  applications,  the  degree  of  accuracy 
provided  is  sufficient,  but  for  some  existing  S&E  applications 
in  the  Corps,  accuracy  may  be  a  problem  for  the  Harris  system. 
While  the  Harris  standard  FP  arithmetic  certainly  provides  far 
greater  computational  accuracy  than  either  the  VAX  or  the  PRIME 
single-precision  arithmetic,  it  may  not  provide  a  level  which 
will  inspire  confidence  and  allay  fears  in  the  Corps'  engineer 
programmer  community.  Thus,  it  is  recommended  that,  for  S&E 
applications,  all  the  tested  systems  be  used  with  caution  and 
that  users  be  ever  alert  to  indications  that  an  application  is 
or  will  be  precision-sensitive.  Such  indications  will  require 
either  an  accuracy  study  for  that  application  or  a  move  to 
another  system.  Note  also  that  it  is  recommended  that  the  "P" 
compiler  option  be  used  as  a  standard  practice  (see  Part  III) 
so  that  input/output  data  fields  are  not  artificially  re¬ 
stricted  to  the  documented  single-precision  length. 

82.  Note  that  blanket  use  of  double-precision  values  imposes  a  penalty 
in  memory  required  and  in  some  systems  a  penalty  in  execution  time.  While 


this  study  did  not  specifically  address  FP  computational  speed,  no  significant 
difference  in  execution  time  for  single-  versus  double-precision  computation 
was  detected  for  any  of  the  three  systems.  A  further  study  would  be  required 
to  make  definitive  statements  regarding  single-  versus  double-precision  speed 
comparisons  for  the  systems  in  question.  Note  also  that  any  performance 
penalties  that  may  accompany  blanket  double-precision  are  imposed  as  a  default 
in  the  Harris  system;  i.e.,  every  FP  value  requires  2  words  (6  bytes).  Thus, 
when  comparison  is  made  of  resource  requirements,  it  must  be  realized  that  the 
VAX  or  PRIME  double-precision  representation  (8  bytes)  does  not  require  double 
the  data  memory  required  by  Harris  standard  representation,  but  that  the  ratio 
is  actually  8  bytes  to  6  bytes. 

Recommendations  for  Error  Avoidance 


83.  Given  an  understanding  of  the  basic  algorithms  for  FP  operations 
and  the  mechanics  of  how  errors  occur,  and  care  in  coding  accuracy-critical 
calculations,  errors  which  might  otherwise  be  inadvertently  created  can  be 
avoided.  Specifically,  the  user  should: 

a.  Be  very  careful  when  subtracting  very  nearly  equal  values.  The 
result  may  contain  very  few  significant  digits,  and  loss  of 
digits  in  a  subsequent  operation  could  be  disastrous.  Also, 

be  aware  that  the  distributive  law  may  fail.  (See  example  in 
Part  IV.) 

b.  Be  careful  of  addition  or  subtraction  operations  performed  on 
numbers  of  vastly  differing  magnitudes.  Some  loss  of  signif¬ 
icance  is  certain,  and  the  associative  law  of  addition  can 
fail.  (See  example  in  Part  IV.) 

c.  Never  test  for  absolute  equality  of  FP  values.  Instead  detect 
approximate  equality  by  testing  for  a  difference  in  value  less 
than  some  relative  e. 

d.  Be  aware  that  the  effectiveness  of  adherence  to  recommenda¬ 
tions  a  and  b  can  be  negated  by  an  optimizing  compiler  which 
reorganizes  code  assuming  associativity  and  distributivity . 
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APPENDIX  A:  RESULTS  OF  MACHAR,  AN  ENVIRONMENTAL  INQUIRY  SUBROUTINE 


1.  Subroutine  MACHAR  (Cody  and  Waite  1980)  dynamically  determines 
13  machine  constants  relating  to  the  floating-point  arithmetic  system.  These 
constants  can  be  used  to  check  manufacturer's  claims  or  documentation  for  a 
system  and  are  specified  below: 

IBETA  -  Base  of  the  floating-point  representation 

IT  -  Number  of  base  IBETA  digits  in  the  floating-point 
significand 

IRND  -  0  if  floating-point  addition  truncates 
1  if  floating-point  addition  rounds 

NGRD  -  Number  of  guard  digits  for  multiplication: 

0  if  IRND  =  1,  or  if  IRND  =  0  and  only  IT  base  IBETA 
digits  participate  in  the  post-normalization  shift  of 
the  floating-point  significand  in  multiplication 

1  if  IRND  =  0  and  more  than  IT  base  IBETA  digits  partici¬ 
pate  in  the  post-normalization  shift  of  the  floating-point 
significand  in  multiplication 

MACHEP  -  Largest  negative  integer  such  that  1 .0+FLOAT(IBETA) 

"MACHEP.  NE.  1.0,  except  that  MACHEP  is  bounded  below  by 
-(IT+3) 

NEGEPS'  -  Largest  negative  integer  such  that  1 .0-FLOAT(IBETA) 

‘  * 'NEGEPS. NE. 1.0,  except  that  NEGEPS  is  bounded  below  by 

-(IT+3) 

IEXP  -  Number  of  bits  (decimal  places  if  IBETA  =  10),  reserved 

for  the  representation  of  the  exponent  (including  the  bias 
or  sign)  of  a  floating-point  number 

MINEXP  -  Largest  magnitude  negative  integer  such  that  FLOAT(IBETA) 

’ 'MINEXP  is  a  positive  floating-point  number 

MAXEXP  -  Largest  positive  integer  exponent  for  a  finite  floating¬ 
point  number 

EPS  -  Smallest  positive  floating-point  number  such  that 

1.0+EPS.NE.1.0.  In  particular,  if  either  IBETA  =  2  or 
IRND  =  0,  EPS  =  FLOAT(IBETA) "MACHEP.  Otherwise, 

EPS  =  (FLOAT (IBETA)"  MACHEP) /2 

EPSNEG  -  A  small  positive  floating-point  number  such  that 

1.0-EPSNEG.NE.1.0.  In  particular,  if  IBETA  =  2  or 
IRND  =  0,  EPSNEG  =  FLOAT(IBETA)" NEGEPS.  Otherwise, 

EPSNEG  =  (IBETA" NEGEPS )/2.  Because  NEGEPS  is  bounded 
below  by  -(IT+3),  EPSNEG  may  not  be  the  smallest  number 
which  can  alter  1.0  by  subtraction 

XMIN  -  Smallest  nonvanishing  floating-point  power  of  the  radix. 

In  particular,  XMIN  =  FLOAT(IBETA)" MINEXP 
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XMAX  *  Largest  finite  floating-point  number.  In  particular, 

XMAX  =  ( 1 . 0-EPSNEG)*FL0AT ( IBETA) *  * MAXEXP .  Note:  On  some 
machines,  XMAX  will  be  only  the  second,  or  perhaps  third, 
largest  number,  being  too  small  by  1  or  2  units  in  the 
last  digit  of  the  significand 

2.  These  constants  are  discussed  further  in  Cody  and  Waite  (1980).  The 
results  of  MACHAR  are  given  in  Table  Al.  Subroutine  MACHAR  is  listed  on 
pages  A4-A6. 
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This  value  appears  to  reflect  the  effect  of  HACHAR  computations  being  perforated  entirely  within  the  PRIME  system's  arithmetic  unit;  i.e., 
without  incurring  the  truncation  of  bits  that  accompanies  the  storing  of  an  FP  value  to  memory.  Therefore,  direct  comparison  of  this  PRIME 
constant  to  the  corresponding  constant  from  the  other  systems  may  not  be  relevant  in  a  general  sense. 

MACHAR  produces  obviously  erroneous  values  for  this  constant. 

HACHAR  produces  an  FP  overflow  while  attempting  to  compute  this  constant. 
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INTEGER  I, IBETA, IEXP, IRND, IT, IZ, J , K, MACHEP, 

I  MX,  NEGEP,  NC-RD 

DOUBLE  PRECISION  A, B, BETA , BETAIN.BETAM1 . EPS, EPSNEG, ONE, XMAX, 

C  XMIN, Y,Z, ZERO 

THIS  ROUTIHE  IS  INTENDED  TO  DETERMINE  THE  CHARACTERISTICS  OF 
THE  FLOATING-POINT  ARITHMETIC  SYSTEM  THAT  ARE  SPECIFIED  BELOW. 

IBETA  -  THE  RADIX  OF  THE  FLOATING-POINT  REPRESENTION 
IT  -  THE  NUMBER  OF  BASE  IBETA  DIGITS  IN  THE  FLOATING-POINT 

SIGNIFICAND 

IRND  -  0  IF  FLOATING-POINT  ADDITION  CHOPS. 

1  IF  FLOATING-POINT  ADDITION  ROUNDS 
NGRD  -  THE  NUMBER  OF  GUARD  DIGITS  FOR  MULTIPLICATION.  IT  IS 
0  IF  IRND-1 >  OR  IF  IRND=0  AND  ONLY  IT  BASE.  IBETA 
DIGITS  PARTICIPATE  IN  THE  POST  NORMALIZATION  SHIFT 
OF  THE  FLOATING-POINT  SIGNIFICAND  IN  MULTIPLICATION 
1  IF  IRND=0  AND  MORE  THAN  IT  BASE  IBETA  DIGITS 

PARTICIPATE  IN  THE  POST  NORMILIZATION  SHIFT  OF  THE 
FLOATING-POINT  SIGNIFICAND  IN  MULTIPLICATION 
MACHEP  -  THE  LARGEST  NEGATIVE  INTEGER  SUCH  THAT 

1.0  +  FLOAT( IBETA JxxMACHEP  .NE.  1.0,  EXCEPT  THAT 
MACHEP  IS  BOUNDED  BELOW  BY  -(IT+3) 

NEGEPS  -  THE  LARGEST  NEGATIVE  INTEGER  SUCH  THAT 

1 . O-FLOAT ( IBETAJxhNEGEPS  .NE.  1.0,  EXCEPT  THAT 
NEGEPS  IS  BOUNDED  BELOW  BY  -< IT+3) 

IEXP  -  THE  NUMBER  OF  BITS  (DECIMAL  PLACES  IF  IBETA  =  10) 
RESERVED  FOR  THE  REPRESENTATION  OF  THE  EXPONENT 
(INCLUDING  THE  BIAS  OR  SIGN)  OF  A  FLOATING-POINT 
NUMBER 

MINEXP  -  THE  LARGEST  IN  MAGNITUDE  NEGATIVE  INTEGER  SUCH  THAT 
FLOAT ( IBETA)XXMINEXP  IS  A  POSITIVE  FLOATING-POINT 
NUMBER 

MAXEXP  -  THE  LARGEST  POSITIVE  INTEGER  EXPONENT  FOR  A  FINITE 
FLOATING-POINT  NUMBER 

ESP  -  THE  SMALLEST  POSITIVE  FLOATING-POINT  NUMBER  SUCH 
THAT  1.0+EPS  .NE.  1.0.  IN  PARTICULAR,  IF  EITHER 
IBETA  =  2  OR  IRND  =  0,  EPS  =  FLOAT! IBETA)H*MACHEP. 

OTHERWISE.  EPS  =  (FL0A(IBETA)**MACHEP)/2 
EPSNEG  -  A  SMALL  POSITIVE  FLOATING-POINT  NUMBER  SUCH  THAT 
1 . 0-EPSNEG  .NE.  1.0  IN  PARTICULAR.  IF  IBETA  =  2 
OR  IRND=0,  EPSNEG  =  FLOAT!  I  BET  A)  <<<<NEGEPS  . 

OTHERWISE,  EPSNEG  =  (IBETA*HNEGEPS)/2.  BECAUSE 
HEGEPS  IS  BOUNDED  BELOW  BY  -!IT*3).  EPSNEG  MAY  NOT 
BE  THE  SMALLEST  NUMBER  WHICH  CAN  ALTER  1 . 0  BY 
SUBTRACTION. 

XHIN  -  THE  SMALLEST  NON-VANISHING  FLOATING-POINT  POWER  OF  THE 
RADIX.  IN  PARTICULAR,  XHIN  =  FLOAT ( IBETA)»»MINEXP 
XMAX  -  THE  LARGEST  FINITE  FLOATING-POINT  NUMBER.  IN 

PARTICULAR  XMAX  =  ( 1 . 0 - EPSNEG) XFLOAT ( IBETA ) KKMAXEXP 
NOTE  -  ON  SOME  MACHINES  XMAX  WILL  BE  ONLY  THE 
SECOND,  OR  PERHAPS  THIRD,  LARGEST  NUMBER.  BEING 
TOO  SMALL  BY  1  OR  2  UNITS  IN  THE  LAST  DIGIT  OF 
THE  SIGNIFICAND. 

LATEST  REVISION  -  OCTOBER  22,  1979 

AUTHOR  -  W.  J.  CODY 
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CALL  MACHAR( IBETA, IT, IRND, NGRD, MACHEP, NEGEP, IEXP, MINEXP, 
IMAXEXP , EPS , EPSNEG , XMIN ,XMAX) 

WRITE(6, 10) IBETA, IT, IRND, NGRD, MACHEP, NEGEP, IEXP, MINEXP 
t  .MAXEXP, EPS, EPSNEG. XMIN, XMAX 


10  FORMATdX,  '  IBETA 

= 

',110,  / 

0 

IX, 'IT 

= 

' , 110  ,  / 

0 

IX, 'IRND 

z 

M10,  / 

t 

IX, 'NGRD 

z 

'  ,110,  / 

t 

IX, 'MACHEP 

z 

’.110,  / 

1 

IX, 'NEGEPS 

z 

',110,  / 

1 

IX, 'IEXP 

z 

',110,  / 

( 

IX, 'MINEXP 

z 

',110,  / 

I 

IX, ’MAXEXP 

z 

',110,  / 

« 

IX, ’EPX 

z 

' , D15 . 9, 

< 

IX, ’EPSNEG 

z 

',015.9, 

( 

IX, 'XMIN 

z 

' ,015.9, 

( 

STOP 

END 

IX, 'XMAX 

' »D15 . 9) 

0004 

0007 


•Ml 


SUBROUTINE  HACHARCIBETA,  IT.  IRND.  NGRD.  NACHEP,  NEGEP. 
I  I  EXP.  NINEXP,  HAXEXP,  EPS,  EPSNEG.XMIN, 

A  XMAX) 

INTEGER  l.IBETA,  IEXP.IRND.  IT.  IZ.  J.  K,  NACHEP, 
t  NAXEXP,  NINEXP,  NX,  NEGEP,  NGRD 

DOUBLE  PRECISION  A, B, BET A, BET A IN, BET AN1, 
t  ESP, EPSNEG, ONE, XHAX.XNIN, Y.Z. ZERO 

THIS  SUBROUTINE  IS  INTENDED  TO  DETERNINE  THE 
CHARACTERISTICS  OF  THE  FLOATING-POINT  ARITHNETIC 
SYSTEN  THAT  ARE  SPECIFIED  BELOW. 


ONE:DBLE( FLOAT ( 1 ) ) 
ZERO  =  0.000 


DETERNINE  IBETA,BETA  ALA  HALCOLH 


A  =  ONE 
10  A  =  A  ♦  A 

IF  ( ((A+ONE)-A)-ONE  .EO.  ZERO)  GO  TO  10 
B  =  ONE 

20  B  =  B  +  B 

IF( (A+B)-A  .EQ.  ZERO)  GO  TO  20 
IBETA  s  INT(SNGL((A  ♦  B)  -  A)> 

BETA  =  DBLE( FLOAT (IBETA) ) 

DETERNINE  IT,  IRND 
IT  s  0 
B  «  ONE 

100  IT  *  IT  ♦  1 
B  =  B  *  BETA 

IF( ( (B  +  ONE)-B)-ONE  .EG.  ZERO)  GO  TO  100 
IRND  =  0 

BETAN1  =  BETA  -ONE 
IF((A*BETAH1)-A  .NE.  ZERO) IRND  =  I 


DETERNINE  NEGEP.  ESPSNEG 


NEGEP  s  IT  ♦  S 
BETAIN  =  ONE/BETA 
A  =  ONE 

DO  200  I  =  1, NEGEP 
A  =  A  *  BETAIN 
200  CONTINUE 


B  =  A 

210  IF((ONE-A)-ONE  .NE.ZERO)GO  TO  220 
A  =  A  *  BETA 
NEGEP  =  NEGEP  -  1 
GO  TO  210 

220  NEGEP  *  -NEGEP 
EPSNEG  =  A 

IF{ (IBETA  .EG.  2)  .OR.  (IRND  .EG.  0))  GO  TO  300 
A  =  ( A*( ONE+A ) )  /(0NE40NE) 

IF  ( (ONE-A)-ONE  .NE.  ZERO) EPSNEG  =  A 

DETERNINE  NACHEP,  EPS 

300  NACHEP  =  -IT  -  3 
A  =  B 

310  IF((ONE+A)-ONE  .NE.  ZERO  )G0  TO  320 
A  =  A  *  BETA 
NACHEP  =  NACHEP  ♦  1 
GO  TO  310 
320  EPS  s  A 

IF  ((IBETA  .EG.  2)  .OR.  (IRND  .EG.  0))  GO  TO  350 
A  =  (An(ONE*A))/(ONE*ONE> 

IF((ONE+A)-ONE  .NE.  ZERO)  EPS  :  A 

DETERNINE  NGRD 

350  NGRD  =0 

IF( ( IRND  .EG.  0)  .AND.  ((0NE+EP5)K0NE-0HE)  .NE.  ZERO)  NGRD  =  1 
DETERNINE  IEXP,  NINEXP,  XHIN 

LOOP  TO  DETERNINE  LARGEST  I  AND  K  =  2«*1  SUCH  THAT 
(1/BETA)  NM  (2KK(l)) 

DOES  NOT  UNDERFLOW 

EXIT  FORN  LOOP  IS  SIGNALED  BY  AH  UNDERFLOW 

1*0 
K  *  1 

Z  *  BETAIN 
400  Y  =  Z 

Z  =  Y  *  Y 


ooo  ouuuo 


CHECK  FOR  UNDERFLOW  HERE 


LOOP  TO  DETERMINE  MIHEXP,  XMIH 

EXIT  FORM  LOOP  IS  SIGNALED  BY  AN  UNDERFLOW. 


450  XMIH  =  Y 

Y  =  Y  *  BETAIH 


CHECK  FOR  UNDERFLOW  HERE 


A  =  Y  w  ONE 

IF( (  (  A+A )  .EQ.  ZERO)  .OR.  (DABS(Y)  .GE.  XMIN))  GO  TO  4(0 
K  =  K  +  1 
GO  TO  450 
460  NINEXP  =  -K 


DETERMINE  MAXEXP.  XMAX 


IF( (MX. GT .  K+K-3)  .OR.  (IBETA  .E«.  10))  GO  TO  500 
MX  =  MX  ♦  MX 
I EXP  =  I EXP  ♦  1 
500  MAXEXP  =  MX  ♦  MIHEXP 


ADJUST  FOR  MACHINES  WITH  IMPLICIT  LEADING 
BIT  IN  BINARY  SIGNIFICAND  AND  MACHINES  WITH 
RADIX  POINT  AT  EXTREME  RIGHT  OF  SIGNIFICAND 


I  =  MAXEXP  ♦  MIHEXP 

IF((IBETA  .EG.  2)  .AND.  (I  .EG.  0))  MAXEXP  =  MAXEXP  -  1 
IF  (I  .GT.  20)  MAXEXP  =  MAXEXP  -  1 
IF( A.NE.  Y)  MAXEXP  =  MAXEXP  -2 
XMAX  =  ONE  -  EPSNEG 

IF<XMAX<<ONE  .HE.  XMAX)  XMAX  =  ONE  -  BETA  *  EPSNEG 
XMAX  =  XMAX  /  (BETA  «  BETA  *  BETA  *  XMIN) 

I  =  MAXEXP  ♦  HINEXP  ♦  3 
IFU  .IE.  0)  GO  TO  520 

DO  510  J  =  1,1 

IF( IBETA  . EO . 2 )XMAX  =  XMAX  ♦  XMAX 
IF  (IBETA  .NE.  2)  XMAX  =  XMAX  *  BETA 
510  CONTINUE 
520  RETURN 


APPENDIX  B:  LISTINGS  OF  PROGRAMS  USED  IN  MATHEMATICAL  TEST  PROBLEMS 

PROGRAM  MAXJ 


2 


C  IMPLICIT  DOUBLE  PRECISION  IA-H.O-Z) 
SL  =  0. 

SU  =  0. 

READ  <5,N)J 
K  =  J  ♦  1 
DO  lit  I  =  1.  J 
VL  =  PLOAT(I) 

VU  =  FLOAKK  -  I) 

SL  s  SL  +  VL 
SU  =  SU  +  VU 
100  CONTINUE 

URITE!6»10)SL,SU 
10  FORMAT! •  ' , E14.8.2X, E14.8) 

DIFF  *  SU  -  SL 
MRITEI6.20)  DIFF 

20  FORMAT!'  DIFFERENCE  =  F10.1) 

STOP 
END 


MAX00010 

MAX00020 

MAX00030 

MAX00040 

MAX00050 

MAX00060 

MAX00070 

MAX00080 

MAX00090 

MAX00100 

MAX00110 

MAX00I20 

MAX00130 

MAX00140 

MAX00150 

MAX00160 

MAX00170 

MAXOOISO 


Program  MATRIX 


REAL  H(6,6),P,9  MAT000I0 

INTEGER  I.  J,  K,  N  MAT00020 

URITEC6.800)  MATB683B 

8BB  FORMAT (1H1.40X.37HTHE  SIX  BY  SIS  MATRIX  TO  BE  INVERTED  //)  MAT00040 

READ(5>  9B8)N  MATB0B56 

900  FORMAT ( 12 )  MATB0060 

C  MATOBB70 

C  READ  IN  THE  ORDER  OF  THE  MATRIX  AND  COMPUTE  THE  MATRIX  MAT00080 

C  PRINT  OUT  THE  ORIGINAL  MATRIX  BY  ROMS  AFTER  COMPLETING  THE  INPUT  PHASEMAT00090 


C  MATB0100 

DO  901  I  =  1,N  MAT00110 

DO  901  J  =  l.N  MAT00120 

901  H(I,J)  =  1.0  /  FLOATU+J-l)  MAT00130 

C  MAT00140 

MRITEC  6 # 801 )  ((H(I,J),J=1,N),I=1,N)  MAT00150 

801  FORMATUH  , 10X.6E16 . 7//)  MAT00160 

C  MAT00170 

C  SELECT  THE  PIVOT  P  AND  SET  HU, I)  TO  1  IN  THE  OUTER  LOOP  MAT00180 

C  MAT00190 

DO  20  I  :  1«N  MAT00200 

P  =  H(I.I)  MAT00210 

H<I, I)  =  1.  MAT00220 

C  MAT00230 

C  DIVIDE  ROM  I  BY  THE  SELECTED  PIVOT  P  MAT00240 

C  MAT00250 

DO  30  J  =  l.N  MAT00260 

H(l.J)  =  H(I,J)  /P  MAT00270 

30  CONTINUE  MAT00280 

C  MAT00290 

C  MANIPULATE  ALL  ROUS  OTHER  THAN  I  TO  EFFECT  THE  ZEROEING  OF  ALL  MAT00300 

C  ELEMENTS  IN  THE  MATRIX  ABOVE  AND  BELOU  THE  SELECTED  PIVOT  MAT00310 

C  MAT00320 

DO  40  J  =  l.N  MAT00330 

IF  (I  . E9.  J)  GO  TO  40  MAT00340 

C  MAT00350 

C  IF  NE  J  THEN  SELECT  THE  ROM  MULTIPLIERS  9  AND  ZERO  H(I.J)  MAT00360 

C  MAT00370 

9  =  H(J,I)  MAT00380 

H(J.I)  =  0.  MAT00390 

C  MAT00400 

C  MANIPULATE  ALL  COLUMS  FOR  ROM  J  SO  THAT  THE  ELEMENT  UNDER  THE  MAT00410 

C  PIVOT  IS  ZEROED  MAT00420 

C  MAT00430 

DO  SO  K  =  l.N  MAT00440 

HCJ.KJ  *  H(J.K)  -  9KH(I,K>  MAT004SO 

50  CONTINUE  MAT00460 

40  CONTINUE  MAT00470 

20  CONTINUE  MAT00480 

C  MAT00490 

C  THE  ALGORITHM  IS  NOW  COMPLETED  AND  THE  INVERSE  IS  NOU  IN  THE  ORIGINAL  MAT00500 

C  MATRIX  H  MAT00510 

C  MAT00520 

C  MAT00530 

WRITEC6 .801 )  CCHCI. J),  J  =  l.N),  1=1, N)  MAT00540 

MAT00550 

END  MAT00560 


Program  POLY 


C  NOISE  LEVEL  FOR  POLYNOMIAL  WHOSE  ZEROS  ARE  FIRST  N  INTEGERS  POLOOOIO 

C  POL00020 

C  PROGRAM  EVALUATES  POLYNOMIAL  FOR  X  =  Z+JMH,  J=-M(1)M.  POLOOOJO 

C  THEN  COMPUTES  MEAN  AND  STANDARD  DEVIATION  OF  P-VALUES.  CARE  IS  POL00040 

C  NEEDED  TO  CHOOSE  H  SMALL  ENOUGH  SO  THAT  TRUE  P-VALUE  ESSENTIALLY  POL00050 

C  CONSTANT  IN  RANGE.  POL00060 

C  POL00070 

C  X.  INPUT  DEGREE  OF  POLYN.  AND  EVALUATE  COEFFTS.  POLOOOSO 

C  POL00090 

DIMENSION  A(20),P(21)  POLOOIOO 

10  WRITE  (4.20)  POLOOHO 

20  FORMAT! •  ENTER  N  -  DEGREE  OF  POLYN.  TO  STOP,  ENTER  N=0.’>  POL00120 

C  POL00130 

READ(S.M)  N  P0L00140 

IF  (N.EG.O)  STOP  POL00150 

A<2)  =  -X  POL00160 

A(l)  =  1  POL00170 

DO  40  K  -  2.N  POLOOISO 

A(K+1)  =  -KMA(K)  POLOOIOO 

DO  30  J  *  l.K  P0100200 

ACK-J+1)  =  A(K-J+1)  -  KNA(K-J)  POL00210 

30  CONTINUE  POL00220 

40  CONTINUE  POL00230 

MMM  =  N  ♦  1  POL00240 

PRINT  M, (Ad), 1=1, MMM)  POL00250 

50  WRITE  (4.60)  POL00260 

60  FORMAT  ('ENTER  Z.H.H  FOR  POLY.  EVAL.  M=0  TO  ALTER  POLY  DEG')  POL00270 

READ  (5.M)  Z.H.M  POL00280 

IF  (M.EO.O)  GO  TO  10  P0100290 

MMM  =  2*M+1  POL00300 

DO  80  J  =  l.MMM  POL00310 

X  =  Z-(M-J+1)«H  P0L00320 

PVAL  =  A( 1 )  P0100330 

DO  70  I  =1.N  PDL00340 

PVAL  =  PVALwX+Add)  P0L00350 

70  CONTINUE  P0L00360 

P(J)  =  PVAL  P0L00370 

80  CONTINUE  POL00380 

MMM  =  2*M+1  POL00390 

C  PRINT  M,  (P(J).Jsl.MMM)  P0L00400 

S  =  0.0  POL00410 

MMM  s  2*H+1  POL00420 

DO  90  K  =  l.MMM  POL00430 

S  =  S  ♦  P(K)  P0100440 

90  CONTINUE  POL00450 

PMEAN  =  S/(2*M*1)  POL00460 

VAR  =  0.0  POL00470 

MMM  =  2*M*1  POL00480 

DO  100  I  =  l.MMM  POL00490 

VAR  =  VAR+(P( I )  -  PMEAN) M»2  P0L00S00 

100  CONTINUE  POL00510 

RM  =  FLOAT(M)  POL00520 

PSDEV  =  S0RT(VAR/(2.MRM) )  POL00530 

PRINT  M, PMEAN. PSDEV  POL00540 

GO  TO  30  POL00550 

END  POL00560 
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Figure  2 
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6  Hilbert 

matrix 

Figure  3.  6*6  Hilbert  matrix 


Figure  4.  CDC  approximate  absolute  error  matrix 
(*  indicates  value  less  than  1.0  E-6) 
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Figure  5.  Harris  approximate  absolute  error  matrices 
(*  indicates  value  less  than  0.05) 
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b.  Double-precision 

**  Figure  6.  IBM  approximate  absolute  error  matrices 
(•  indicates  value  less  than  0.05) 
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results  have  been  rounded  to  1  decimal  place). 

61.  The  CDC,  the  baseline  system,  has  a  very  small  absolute  error 
matrix.  The  Harris  is  the  only  system  with  a  small  error  matrix  in  single- 

••  precision.  The  double-precision  approximate  error  matrices  for  the  PRIME, 

**  IBM,  and  VAX  were  smaller  than  that  for  the  Harris  in  quadruple-precision. 

Real  World  Problems 

62.  For  the  S&E  problems  in  this  study,  three  computer  programs  were 
selected  that  have  been  in  use  for  several  years  and  have  been  thoroughly 
validated.  These  programs  solve  certain  types  of  soil-structure  interaction 
problems  that  are  complex  and  for  which  no  simple  closed-form  solution  exist. 
Often  the  algorithm  employed  in  the  solution  of  a  soil-structure  interaction 
problem  will  yield  incorrect  results  if  the  number  of  significant  figures 
carried  in  the  calculation  is  insufficient. 

63.  Three  problems  were  chosen  to  show  that  different  results  can  be 
obtained  when  Identical  problems  are  run  on  each  of  the  systems  tested  using 
the  same  program.  The  problems  have  already  been  used  in  a  previous  study 
(O’Neil  and  Peterson  1976)  with  the  same  programs  on  a  CDC  system.  The  re¬ 
ported  solutions  from  the  CDC  were  duplicated  in  this  study.  The  conclusions 
from  the  previous  study  indicated  that  the  CDC's  solutions  were  acceptable; 
therefore,  these  solutions  are  used  for  comparison  with  the  results  from  the 
systems  tested. 

Case  1 — flexible  sheet  pile  bulkhead  in  sand 

64.  The  deflections  and  moments  produced  in  a  flexible  sheet  pile  bulk 
head  embedded  in  sand  are  to  be  computed  for  the  physical  system  depicted  in 
Figure  9.  The  problem  was  solved  by  using  a  computer  code  (BMCOL  28)  which 
establishes  the  finite  difference  equations  for  a  beam  on  an  elastic  subgrade 
at  predesignated  equally  spaced  nodes  along  the  bulkhead  (Matlock  and  Ingram 
1963).  The  system  of  linear  difference  equations  so  generated  forms  a  matrix 
equation  in  which  the  stiffness  matrix  is  tightly  banded.  Recursive  tech¬ 
niques  are  used  to  solve  for  the  deflection  at  each  node,  and  moments,  shears 
and  soil  reactions  are  subsequently  computed. 

65.  It  is  common  practice  to  analyze  problems  like  that  shown  in  Fig¬ 
ure  9  by  utilizing  50  to  100  nodes,  but  such  solutions  may  be  relatively 
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